climateprediction.net (CPDN) home page
Thread 'Reporting - Errors while computing -'

Thread 'Reporting - Errors while computing -'

Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45195 - Posted: 29 Oct 2012, 9:38:52 UTC


Reporting - Errors while computing -

I would like to report the following - Errors while computing - in case it would give Andy and Jonathan some information they need:

Task 15408322

Task 15407533

Task 15407290

Task 15404774

Task 15402557

Task 15401805

all with the following Stderr file:

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Sorry, too many model crashes! :-(

Called boinc_finish

</stderr_txt>
]]>

I hope this helps,
Byron


ID: 45195 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 45196 - Posted: 29 Oct 2012, 11:28:49 UTC - in response to Message 45195.  

Thanks, Byron. Reports of models failing with "REPLANCA" errors have been passed onto the project team and the cause is currently being investigated.
ID: 45196 · Report as offensive     Reply Quote
DouglasRH

Send message
Joined: 21 Jan 09
Posts: 1
Credit: 617,665
RAC: 78
Message 45306 - Posted: 4 Dec 2012, 1:48:01 UTC

Hi,
I have the BOINC ClimatePredciton.Net running with no problems on my Quad core x86/ Vista 32.

However when I try to run it on my x64 Hex core Windows7 x64 all I get are computational errors: Exit status 22 (0x16), communications deferred for an hour and 'No work available to process'

I've tried extensive computing preference changes including GPU on/off etc. Nothing works.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1255061
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15468768

Thanks for any and all assistance.

Regards,
DougRH

ID: 45306 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45307 - Posted: 4 Dec 2012, 2:44:08 UTC - in response to Message 45306.  

The programs here are 32 bits, and require 32 libraries for any 64 bit OS.

You're getting no new work from the project because there isn't any.
There's a thread in this section of the board called Project has no tasks available. Rambles a bit near the end, but ...

The Server Status page has a link in the blue menu to the left of here, 5 from the bottom.


Backups: Here
ID: 45307 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45479 - Posted: 19 Jan 2013, 2:29:20 UTC
Last modified: 19 Jan 2013, 2:40:07 UTC

-



approx. 2 hours ago I had the following - Full Resolution Ocean v6.07 Model Crash at 45%

so I would just like to report the following - Errors while computing

in case it might give Andy, Jonathan and the crew at Oxford ... some hints or information they might need ??

or is this crash the fault of me or my computer ??

hadcm3n_38o6_1940_40_008261906_1
hadcm3n_38o6_1940_40_008261906_1
Workunit 8417030
Created 9 Jan 2013 13:53:48 UTC
Sent 9 Jan 2013 13:54:16 UTC
Received 18 Jan 2013 23:43:06 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 22 (0x16)
Computer ID 1167855
Report deadline 10 Apr 2013 21:21:27 UTC

Run time 811,078.29
CPU time 645,155.60
Validate state Invalid
Claimed credit 0.00
Granted credit 5,598.72
application version UK Met Office Coupled Model Full Resolution Ocean v6.07

Stderr

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048

Sorry, too many model crashes! :-(

Called boinc_finish

</stderr_txt>
]]>


18/01/2013 3:41:45 PM | climateprediction.net | Computation for task hadcm3n_38o6_1940_40_008261906_1 finished
18/01/2013 3:41:45 PM | climateprediction.net | Output file hadcm3n_38o6_1940_40_008261906_1_2.zip for task hadcm3n_38o6_1940_40_008261906_1 absent
18/01/2013 3:41:45 PM | climateprediction.net | Output file hadcm3n_38o6_1940_40_008261906_1_3.zip for task hadcm3n_38o6_1940_40_008261906_1 absent
18/01/2013 3:41:45 PM | climateprediction.net | Output file hadcm3n_38o6_1940_40_008261906_1_4.zip for task hadcm3n_38o6_1940_40_008261906_1 absent


I hope this helps,
Byron
ID: 45479 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45480 - Posted: 19 Jan 2013, 3:00:53 UTC - in response to Message 45479.  

Hi Byron

Invalid Theta is when the physics goes wrong, and built in checks stop the model.
It's what the researchers are looking for, so that they know what the result is of starting a model with the values that it was given.

So, No Worries. :)


Backups: Here
ID: 45480 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45488 - Posted: 23 Jan 2013, 19:33:59 UTC - in response to Message 45480.  
Last modified: 23 Jan 2013, 19:37:50 UTC

-

Hi Byron

Invalid Theta is when the physics goes wrong, and built in checks stop the model.
It's what the researchers are looking for, so that they know what the result is of starting a model with the values that it was given.

So, No Worries.
:)


Hi Les

I have been away for a couple of days ... so just now reading your message.

aha ... ok thanks for that info and that explanation ... I understand now :)

it's good to hear that this might provide the project team with info that could be useful to them.

Byron
ID: 45488 · Report as offensive     Reply Quote
old_user490835

Send message
Joined: 23 Dec 07
Posts: 3
Credit: 682,099
RAC: 0
Message 45548 - Posted: 12 Feb 2013, 20:00:48 UTC

i did get some window telling me "library of c++ has crashed" etc and there goes my 80hours of computing :(

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15599675
Client state Compute error
Exit status 22 (0x16)

Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2096, iMonCtr=1
Model crash detected, will try to restart...
Sorry, too many model crashes! :-(
Called boinc_finish

any ideas ? my machine broke or the model(wu) ?
ID: 45548 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 45549 - Posted: 12 Feb 2013, 21:44:25 UTC - in response to Message 45548.  
Last modified: 12 Feb 2013, 21:49:05 UTC

Looking at the tasks page for your computer, GuruFin, there has been a great variety of reasons for models crashing on your computer. Possibly there is more than one issue.

For best results with climate models, which stress the CPU and memory more heavily than almost anything else, and which are fussy about disk access, the following are recommended, in this order:

1. Do not overclock.

2. Ensure that your virus scanner excludes the Boinc data folder and all sub-folders. (That is the folder with two sub-folders "projects" and "slots".)

3. In Boinc preferences - disk and memory usage, ensure that "leave applications in memory when suspended" is selected, and allow Boinc to use up to 75% of memory. (At least 1 GB per running task is best; mostly 500MB works too. Mostly.) Also ensure that Boinc has enough disk space, 2 GB per CPU at least.

4. Shut down Boinc (suspend all work) when you play games that have demanding video requirements.

5. For a multi-processor system such as yours, in processor usage preferences, set Boinc to "Use at most 100% of CPU time", and control the amount of work with for example "use at most 75 % of processors" (change the 75 to whatever you like).

If you have done these, and are still getting errors, your RAM may be running out of specification. Run a memory test program such as memtest86+ for at least 48 hours to check. Alternatively, the power supply for your computer may be unable to supply enough power, or the motherboard is using the not-recommended "voltage boost" feature that some have.

Edit: I should point out that there will still be some apparent failures even after doing all of this. Some climate models fail because they generate physically impossible atmospheric pressures or potential temperatures. A few other have been sent out with the wrong data files - these normally crash straight away, though.
ID: 45549 · Report as offensive     Reply Quote
old_user490835

Send message
Joined: 23 Dec 07
Posts: 3
Credit: 682,099
RAC: 0
Message 45550 - Posted: 12 Feb 2013, 22:13:08 UTC - in response to Message 45549.  
Last modified: 12 Feb 2013, 22:19:11 UTC

Looking at the tasks page for your computer, GuruFin, there has been a great variety of reasons for models crashing on your computer. Possibly there is more than one issue.

1. Do not overclock.

little overclock ? :) (4.2ghz turbo freq)


2. Ensure that your virus scanner excludes the Boinc data folder and all sub-folders. (That is the folder with two sub-folders "projects" and "slots".)

not interfering.. no problems


3. In Boinc preferences - disk and memory usage, ensure that "leave applications in memory when suspended" is selected, and allow Boinc to use up to 75% of memory. (At least 1 GB per running task is best; mostly 500MB works too. Mostly.) Also ensure that Boinc has enough disk space, 2 GB per CPU at least.

been allready this way or better (boinc can use 8gb of mem @16gb installed)


4. Shut down Boinc (suspend all work) when you play games that have demanding video requirements.

yep.. done this way too


5. For a multi-processor system such as yours, in processor usage preferences, set Boinc to "Use at most 100% of CPU time", and control the amount of work with for example "use at most 75 % of processors" (change the 75 to whatever you like).

yep.. done this way too


If you have done these, and are still getting errors, your RAM may be running out of specification. Run a memory test program such as memtest86+ for at least 48 hours to check. Alternatively, the power supply for your computer may be unable to supply enough power, or the motherboard is using the not-recommended "voltage boost" feature that some have.

my memory modules are running at 1066mhz (not 1333mhz as specs say)
because i like stable pc. (memtest runs fine and "intel burn test" too)


Edit: I should point out that there will still be some apparent failures even after doing all of this. Some climate models fail because they generate physically impossible atmospheric pressures or potential temperatures. A few other have been sent out with the wrong data files - these normally crash straight away, though.

thats why i asked opinion in my failed tests :)

maybe my issue is corrupt "c++".dll ?

but what is curious is that my wu's do run 60-120hours good and then fail suddenly...

thank you !
ID: 45550 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 45551 - Posted: 12 Feb 2013, 22:53:58 UTC - in response to Message 45550.  

Intel's turbo boost won't be a problem. According to Intel's literature it operates for up to two or three seconds when one process is using a lot of one core and the other cores are idle. In this situation the chip won't get too hot and unstable.

But if you are running more than one CPDN model at a time, turbo boost won't operate. And even with only one model, it will exceed the "a few seconds" time limit, so the CPU will cycle: a few seconds on turbo, then 10 or 20 seconds at normal speed, turbo for 2 or 3 seconds, back to normal... Interesting to watch, if you like that kind of thing.

Manually overclocking to the turbo boost frequency is not recommended. Together with underclocking the RAM, you may get just the results you are seeing.

Now I've given you the overclocking lecture. :-)

Several other people have reported the C++ DLL crash over the last few years (that I have seen).

Solving the problem was always difficult. Sometimes the problem was blamed on video drivers, but I can't remember whether ATI/AMD or Nvidia is the bigger suspect. Sometimes the screen saver was suspected, or other software such as Microsoft SQL Server, which will try to grab all the memory for itself. Sometimes a corrupt download of the BOINC software was suspected.

If you are confident in your video card, its drivers, and in the power supply, then the way forward is probably to disconnect from CPDN and all other projects, uninstall boinc, delete its data folder and program folder via windows explorer, download a fresh copy of boinc, and re-connect to CPDN.

But that may not work either. Some combinations of CPU, RAM, and motherboard just seem less reliable. I had a core i3 (Clarkdale) on a Gigabyte H55 board with Hynix memory that was like that. Worked perfectly for everything except CPDN.
ID: 45551 · Report as offensive     Reply Quote
old_user490835

Send message
Joined: 23 Dec 07
Posts: 3
Credit: 682,099
RAC: 0
Message 45554 - Posted: 15 Feb 2013, 4:35:37 UTC - in response to Message 45551.  

thanks for the good answers :)
ill do some crunching and adjust my system to see if its stable enough to cpdn :)

i did previously (month ago) have different mb and processor (i2550k/p8p67) but now a new setup (i3770k/sabertooth) so lets see if that helps or not :)
ID: 45554 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45616 - Posted: 7 Mar 2013, 16:17:03 UTC
Last modified: 7 Mar 2013, 16:51:07 UTC

-

Hello everyone

I would like to report the following - Errors while computing - in case it might give the project team some information or clues they might need ?

maybe some one could pass the errors onto the project team and the cause could be investigated ?

the same Model crashed twice with the same <stderr_txt> file - with two different computers.

application version UK Met Office Coupled Model Full Resolution Ocean v6.07

name hadcm3n_zipn_2000_40_008323389

Workunit 8474524

application UK Met Office Coupled Model Full Resolution Ocean v6.07

created 2 Mar 2013 1:57:59 UTC

my computer 1167855

on ths computer

I'm only running one project - only Climate Prediction.net - 24/7/365 - and no GPU - apps

8 physical CPU - no Hyper threading -


Stderr <core_client_version>7.0.28</core_client_version>

<![CDATA[

<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>

<stderr_txt>

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Sorry, too many model crashes! :-(

Called boinc_finish

</stderr_txt>
]]>


07/03/2013 12:13:18 AM | climateprediction.net | Computation for task hadcm3n_zipn_2000_40_008323389_1 finished

07/03/2013 12:13:18 AM | climateprediction.net | Output file hadcm3n_zipn_2000_40_008323389_1_1.zip for task hadcm3n_zipn_2000_40_008323389_1 absent
07/03/2013 12:13:18 AM | climateprediction.net | Output file hadcm3n_zipn_2000_40_008323389_1_2.zip for task hadcm3n_zipn_2000_40_008323389_1 absent
07/03/2013 12:13:18 AM | climateprediction.net | Output file hadcm3n_zipn_2000_40_008323389_1_3.zip for task hadcm3n_zipn_2000_40_008323389_1 absent
07/03/2013 12:13:18 AM | climateprediction.net | Output file hadcm3n_zipn_2000_40_008323389_1_4.zip for task hadcm3n_zipn_2000_40_008323389_1 absent

I'm just curious, what dose the following <stderr_txt> file mean ?

<stderr_txt> file

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

</stderr_txt> file

I hope this helps,
Byron
ID: 45616 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45617 - Posted: 7 Mar 2013, 20:03:21 UTC - in response to Message 45616.  

'REPLANCA' etc, means that one of the supporting data files has the wrong number of entries, so they don't match what the main program is expecting.
Someone at the research place has gotten it wrong. :(

The project people were notified yesterday that the entire batch appears to be faulty.


Backups: Here
ID: 45617 · Report as offensive     Reply Quote
old_user671679

Send message
Joined: 30 Jan 12
Posts: 38
Credit: 10,197,388
RAC: 0
Message 45619 - Posted: 7 Mar 2013, 20:25:37 UTC

Oh man, are you serious? Another bad batch? Maybe those boys need a vacation. I hope my computers don't get put on probation from these and the PNW wu's. Ah well, I guess we wait.

Hey Les, how long do you suppose it would take before they roll out the new Africa project?
ID: 45619 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45620 - Posted: 7 Mar 2013, 22:52:10 UTC - in response to Message 45619.  

Re: Africa project
No information, but wild guess: a year. Christmas present? :)

As for the bad batch: According to this page, there's a lot of research centres involved in the current hadcm3 work, so it could be a work experience person at any of them. Only the Uni of Oregon involved with the PNW models, so someone there.

ANZ should be along soon.


Backups: Here
ID: 45620 · Report as offensive     Reply Quote
old_user671679

Send message
Joined: 30 Jan 12
Posts: 38
Credit: 10,197,388
RAC: 0
Message 45621 - Posted: 7 Mar 2013, 23:31:49 UTC

I'm sorry Les, what is ANZ? I don't think I've heard of that one, I hope you don't mind me picking you're brain for a moment, I am wondering about the Full Resolution Ocean models. What ocean are they modeling and what exactly are they looking for? I've read allot of you're posts and you seem very knowledgeable about this project.

Thanks in advance.
ID: 45621 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45622 - Posted: 8 Mar 2013, 0:53:35 UTC - in response to Message 45617.  

'REPLANCA' etc, means that one of the supporting data files has the wrong number of entries, so they don't match what the main program is expecting.
Someone at the research place has gotten it wrong. :(

The project people were notified yesterday that the entire batch appears to be faulty.

aha ... ok thank you Les.

Best Wishes
Byron
ID: 45622 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45623 - Posted: 8 Mar 2013, 1:38:27 UTC - in response to Message 45621.  

ANZ = Australia - New Zealand area, which is a big one.
It's been in beta for a while, and is now held up by problems at the researcher's end.
There's a thread a little way down this section about ANZ.


The Ocean is "all of it".
Several of the other model types use HadSM3 which has what's called a 'slab' ocean, i.e. it has certain fixed values, which makes modelling simpler and faster, when the aim is to study the atmosphere.

The Coupled Ocean model, HadCM3, has both an ocean part, (at lower resolution, because changes there are much slower), and an atmosphere part. This can be seen by watching the data part of the graphics display for a while.
The current use of this model is for the RAPID-RAPIT experiment. There is also a thread about it here by the researchers.

And then there are the 'regional' models.
These use a simplified model for the bulk of the globe, with a high resolution model for a small area. To run this high res model for the full globe would require a supercomputer to finish it in a reasonable time.



Backups: Here
ID: 45623 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 45624 - Posted: 8 Mar 2013, 11:24:17 UTC

Is the data still useful from those or will they be re-issued anyway when fixed?

Also even if they are not useful don't abort without checking the graphics - the last one I got didn't start in 2000 so was clearly not from that batch.
ID: 45624 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : Reporting - Errors while computing -

©2024 cpdn.org