climateprediction.net (CPDN) home page
Thread 'WUs constantly failing'

Thread 'WUs constantly failing'

Message boards : Number crunching : WUs constantly failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
old_user113466

Send message
Joined: 23 Nov 05
Posts: 18
Credit: 407,491
RAC: 0
Message 19989 - Posted: 6 Feb 2006, 1:49:20 UTC

I have yet to complete a sulphur model do to continual client errors
Is it me , or the model ?
Why do I get credit for a client error? If my host cant do it then lets move on.

Sample msg
<core_client_version>5.2.13</core_client_version>
<message><file_xfer_error>
<file_name>sulphur_itus_100878500_1_1.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_itus_100878500_1_2.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_itus_100878500_1_3.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_itus_100878500_1_4.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_itus_100878500_1_5.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

</message>

Thanks for any help

DP

ID: 19989 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19992 - Posted: 6 Feb 2006, 2:42:24 UTC

You get credits each time you trickle, as per the FAQ.

ID: 19992 · Report as offensive     Reply Quote
old_user113466

Send message
Joined: 23 Nov 05
Posts: 18
Credit: 407,491
RAC: 0
Message 19995 - Posted: 6 Feb 2006, 4:17:12 UTC

Thanks for getting back

Let me be more specific. Are my client errors wasting time on both sides, me and CPDN?
Do they convey valuable info back to the scientific assumptions?
Is an error useful to massaging future thinking or am I just getting an atta-boy back for my cpu time?

ie2/5/2006 9:29:22 PM|climateprediction.net|Computation for result sulphur_hfa8_100812960_0 finished
2/5/2006 9:29:22 PM|Predictor @ Home|Resuming result h0017B_1_138865_1 using mfoldB125 version 428
2/5/2006 9:29:23 PM|climateprediction.net|Unrecoverable error for result sulphur_hfa8_100812960_0 (<file_xfer_error> <file_name>sulphur_hfa8_100812960_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
2/5/2006 9:42:16 PM||request_reschedule_cpus: process exited


DP
ID: 19995 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19997 - Posted: 6 Feb 2006, 4:57:28 UTC

Trickles are to tell the server that the model is alive, and is up to \'x,y,z\' of the processing. At the end of each phase, a large zip file of data gets sent back; the first is about 8Megs, the rest about 2Megs.
It only becomes worthwhile if the end of the first phase is reached, and the data sent back. After this, ALL end of phase zip files are needed, to be further worthwhile.
At the moment, there have been 2380 sulphur models completed, so it is possible.
The next part of the experiment will be different, as regards to size of data on hds, when and how much data is returned, and the files left on the hd at the end of a model. But the run time will still be long.

The error messages are usefull for debugging.
To some extent.
Mostly, it is long time users such as myself who help out with this.
As has been posted MANY times, all over the help boards, the 161 error message tells us nothing. It\'s what\'s in yabsd.out, (in the dataout folder of the model\'s folder), that often provides a clue.

When the two experiments due for imminent release are out of the way, the two programers will be able to devote some time to looking into the rash of suphur failures.

As your computers are constantly failing here at present, perhaps you should set them for \'No new work\' from here, and concentrate on other projects for a few weeks. Look back now and then to see if there is something new, perhaps in the front page News section.

ID: 19997 · Report as offensive     Reply Quote
Curtis

Send message
Joined: 16 Dec 05
Posts: 27
Credit: 242,905
RAC: 1,153
Message 20001 - Posted: 6 Feb 2006, 6:53:15 UTC

Ya. I just got the same errors but different model i think:
sulphur_ghkh_000769265_0
Result id:1474958
ID: 20001 · Report as offensive     Reply Quote
Curtis

Send message
Joined: 16 Dec 05
Posts: 27
Credit: 242,905
RAC: 1,153
Message 20002 - Posted: 6 Feb 2006, 7:01:30 UTC

Do you get the yabsd.out file or do we need to send it somewhere some how?
ID: 20002 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 20003 - Posted: 6 Feb 2006, 7:09:34 UTC

We don\'t have access to your computer, so you have to copy and paste the data here.
The last dozen or so lines should be enough to see what is happening.
Mostly, it will probably be: \"Oh, right. Another one of those.\"
But you never know, it may be different.

When you say, (in your previous post), \"a different model\", where you refering to a different model name to dp?
If so, then you need to know that every one gets a different data set and model name. There are no qorums here as used in SETI, etc.

ID: 20003 · Report as offensive     Reply Quote
old_user113466

Send message
Joined: 23 Nov 05
Posts: 18
Credit: 407,491
RAC: 0
Message 20012 - Posted: 7 Feb 2006, 1:36:40 UTC - in response to Message 19997.  

As your computers are constantly failing here Look back now and then to see if there is something new, perhaps in the front page News section.

Thanks I\'ll be back

DP
ID: 20012 · Report as offensive     Reply Quote
Curtis

Send message
Joined: 16 Dec 05
Posts: 27
Credit: 242,905
RAC: 1,153
Message 20014 - Posted: 7 Feb 2006, 2:12:39 UTC

NOCNINDX Namelist is
$NOCNINDX
J_1 = 1
J_2 = 2
J_3 = 3
J_JMT = 73
J_JMTM1 = 72
J_JMTM2 = 71
J_JMTP1 = 74
JST = 1
JFIN = 73
J_FROM_LOC = 0
J_TO_LOC = 0
JMT_GLOBAL = 73
JMTM1_GLOBAL = 72
JMTM2_GLOBAL = 71
JMTP1_GLOBAL = 74
J_OFFSET = 0
O_MYPE = 0
O_EW_HALO = 0
O_NS_HALO = 0
J_PE_JSTM1 = -1
J_PE_JSTM2 = -1
J_PE_JFINP1 = -1
J_PE_JFINP2 = -1
O_NPROC = 1
IMOUT = 4*0
JMOUT = 4*0
J_PE_IND_MED = 4*0
NMEDLEV = 0
$END
SLAB TIMESTEP 2
im,sm,ngroup,new_im,new_sm 1 1 48 T F
FINAL TOTAL ENERGY = 0.45221E+27 J/
INITIAL TOTAL ENERGY = 0.45217E+27 J/
CHG IN TOTAL ENERGY OVER DAY = 0.37262E+23 J/
FLUXES INTO ATM OVER DAY = 0.88673E+23 J/
ERROR IN ENERGY BUDGET = 0.51410E+23 J/
TEMP CORRECTION OVER DAY = 0.28450E-01 K
TEMPERATURE CORRECTION RATE = 0.32929E-06 K/S
FLUX CORRECTION (ATM) = 0.33312E+01 W/M2
FINAL ATM MASS = 0.17980E+22 KG
INITIAL ATM MASS = 0.17980E+22 KG
CORRECTION FACTOR FOR PSTAR = 0.10000E+01
im,sm,ngroup,new_im,new_sm 3 1 1 T F
NOCNINDX Namelist is
$NOCNINDX
J_1 = 1
J_2 = 2
J_3 = 3
J_JMT = 73
J_JMTM1 = 72
J_JMTM2 = 71
J_JMTP1 = 74
JST = 1
JFIN = 73
J_FROM_LOC = 0
J_TO_LOC = 0
JMT_GLOBAL = 73
JMTM1_GLOBAL = 72
JMTM2_GLOBAL = 71
JMTP1_GLOBAL = 74
J_OFFSET = 0
O_MYPE = 0
O_EW_HALO = 0
O_NS_HALO = 0
J_PE_JSTM1 = -1
J_PE_JSTM2 = -1
J_PE_JFINP1 = -1
J_PE_JFINP2 = -1
O_NPROC = 1
IMOUT = 4*0
JMOUT = 4*0
J_PE_IND_MED = 4*0
NMEDLEV = 0
$END
SLAB TIMESTEP 3
3395537 words long
MODEL DUMP SUCCESSFULLY WRITTEN - 3434914 WORDS TO UNIT 22

Number of Words Written to Disk was 3436498
im,sm,ngroup,new_im,new_sm 1 1 48 T F
FINAL TOTAL ENERGY = 0.45222E+27 J/
INITIAL TOTAL ENERGY = 0.45221E+27 J/
CHG IN TOTAL ENERGY OVER DAY = 0.15717E+23 J/
FLUXES INTO ATM OVER DAY = 0.67759E+23 J/
ERROR IN ENERGY BUDGET = 0.52042E+23 J/
TEMP CORRECTION OVER DAY = 0.28800E-01 K
TEMPERATURE CORRECTION RATE = 0.33333E-06 K/S
FLUX CORRECTION (ATM) = 0.33722E+01 W/M2
FINAL ATM MASS = 0.17980E+22 KG
INITIAL ATM MASS = 0.17980E+22 KG
CORRECTION FACTOR FOR PSTAR = 0.10000E+01
im,sm,ngroup,new_im,new_sm 3 1 1 T F
NOCNINDX Namelist is
$NOCNINDX
J_1 = 1
J_2 = 2
J_3 = 3
J_JMT = 73
J_JMTM1 = 72
J_JMTM2 = 71
J_JMTP1 = 74
JST = 1
JFIN = 73
J_FROM_LOC = 0
J_TO_LOC = 0
JMT_GLOBAL = 73
JMTM1_GLOBAL = 72
JMTM2_GLOBAL = 71
JMTP1_GLOBAL = 74
J_OFFSET = 0
O_MYPE = 0
O_EW_HALO = 0
O_NS_HALO = 0
J_PE_JSTM1 = -1
J_PE_JSTM2 = -1
J_PE_JFINP1 = -1
J_PE_JFINP2 = -1
O_NPROC = 1
IMOUT = 4*0
JMOUT = 4*0
J_PE_IND_MED = 4*0
NMEDLEV = 0
$END
SLAB TIMESTEP 4
im,sm,ngroup,new_im,new_sm 1 1 48 T F

ID: 20014 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 20098 - Posted: 10 Feb 2006, 14:09:19 UTC
Last modified: 10 Feb 2006, 14:11:21 UTC

[url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1612048]This[\\url]
result and [url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1351239]this[\\url] one, both on the same machine, failed at exactly the same point. The machine cannot get past this point in sulphur, despite having many successful slab models to its credit. Two of my other machines have also failed on Sulphur, though less repeatably. I must say that I find this problem quite frustrating. I know the team is focused on the new experiments, but if this undiagnosed problem persists with the coupled model, it will begin to sap my (considerable) commitment to this project. :(

Edit: Sorry, can\'t remember how to put in links but you have the URLs at least.
ID: 20098 · Report as offensive     Reply Quote
old_user31578

Send message
Joined: 28 Nov 04
Posts: 9
Credit: 687,368
RAC: 0
Message 20129 - Posted: 11 Feb 2006, 11:49:17 UTC

I have similar problems with the sulphur models, one example is:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1753881


ID: 20129 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 20130 - Posted: 11 Feb 2006, 13:55:43 UTC
Last modified: 11 Feb 2006, 14:16:37 UTC

@ Egon and KeeperC
These generic errors messages have not been reported during the Couple Model tests.
Hopefully, you\'ll be able to run this new model without problems.
Arnaud
ID: 20130 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 20131 - Posted: 11 Feb 2006, 14:12:18 UTC

Egon,

The problem is with Linux sulphur 4.23. See this sticky in the \"BOINC Questions and Problems\" Linux forum if you haven\'t already.
ID: 20131 · Report as offensive     Reply Quote
old_user31578

Send message
Joined: 28 Nov 04
Posts: 9
Credit: 687,368
RAC: 0
Message 20132 - Posted: 11 Feb 2006, 17:51:08 UTC

Thanks for the info geophi, I will crunch some other boinc projects until next experiment is going live.
/Egon
ID: 20132 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : WUs constantly failing

©2024 cpdn.org