Message boards : Number crunching : WUs constantly failing
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0 |
I have yet to complete a sulphur model do to continual client errors Is it me , or the model ? Why do I get credit for a client error? If my host cant do it then lets move on. Sample msg <core_client_version>5.2.13</core_client_version> <message><file_xfer_error> <file_name>sulphur_itus_100878500_1_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> </message> Thanks for any help DP |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
You get credits each time you trickle, as per the FAQ. |
Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0 |
Thanks for getting back Let me be more specific. Are my client errors wasting time on both sides, me and CPDN? Do they convey valuable info back to the scientific assumptions? Is an error useful to massaging future thinking or am I just getting an atta-boy back for my cpu time? ie2/5/2006 9:29:22 PM|climateprediction.net|Computation for result sulphur_hfa8_100812960_0 finished 2/5/2006 9:29:22 PM|Predictor @ Home|Resuming result h0017B_1_138865_1 using mfoldB125 version 428 2/5/2006 9:29:23 PM|climateprediction.net|Unrecoverable error for result sulphur_hfa8_100812960_0 (<file_xfer_error> <file_name>sulphur_hfa8_100812960_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>) 2/5/2006 9:42:16 PM||request_reschedule_cpus: process exited DP |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Trickles are to tell the server that the model is alive, and is up to \'x,y,z\' of the processing. At the end of each phase, a large zip file of data gets sent back; the first is about 8Megs, the rest about 2Megs. It only becomes worthwhile if the end of the first phase is reached, and the data sent back. After this, ALL end of phase zip files are needed, to be further worthwhile. At the moment, there have been 2380 sulphur models completed, so it is possible. The next part of the experiment will be different, as regards to size of data on hds, when and how much data is returned, and the files left on the hd at the end of a model. But the run time will still be long. The error messages are usefull for debugging. To some extent. Mostly, it is long time users such as myself who help out with this. As has been posted MANY times, all over the help boards, the 161 error message tells us nothing. It\'s what\'s in yabsd.out, (in the dataout folder of the model\'s folder), that often provides a clue. When the two experiments due for imminent release are out of the way, the two programers will be able to devote some time to looking into the rash of suphur failures. As your computers are constantly failing here at present, perhaps you should set them for \'No new work\' from here, and concentrate on other projects for a few weeks. Look back now and then to see if there is something new, perhaps in the front page News section. |
Send message Joined: 16 Dec 05 Posts: 27 Credit: 242,905 RAC: 1,153 |
Ya. I just got the same errors but different model i think: sulphur_ghkh_000769265_0 Result id:1474958 |
Send message Joined: 16 Dec 05 Posts: 27 Credit: 242,905 RAC: 1,153 |
Do you get the yabsd.out file or do we need to send it somewhere some how? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
We don\'t have access to your computer, so you have to copy and paste the data here. The last dozen or so lines should be enough to see what is happening. Mostly, it will probably be: \"Oh, right. Another one of those.\" But you never know, it may be different. When you say, (in your previous post), \"a different model\", where you refering to a different model name to dp? If so, then you need to know that every one gets a different data set and model name. There are no qorums here as used in SETI, etc. |
Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0 |
As your computers are constantly failing here Look back now and then to see if there is something new, perhaps in the front page News section. Thanks I\'ll be back DP |
Send message Joined: 16 Dec 05 Posts: 27 Credit: 242,905 RAC: 1,153 |
NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 4*0 JMOUT = 4*0 J_PE_IND_MED = 4*0 NMEDLEV = 0 $END SLAB TIMESTEP 2 im,sm,ngroup,new_im,new_sm 1 1 48 T F FINAL TOTAL ENERGY = 0.45221E+27 J/ INITIAL TOTAL ENERGY = 0.45217E+27 J/ CHG IN TOTAL ENERGY OVER DAY = 0.37262E+23 J/ FLUXES INTO ATM OVER DAY = 0.88673E+23 J/ ERROR IN ENERGY BUDGET = 0.51410E+23 J/ TEMP CORRECTION OVER DAY = 0.28450E-01 K TEMPERATURE CORRECTION RATE = 0.32929E-06 K/S FLUX CORRECTION (ATM) = 0.33312E+01 W/M2 FINAL ATM MASS = 0.17980E+22 KG INITIAL ATM MASS = 0.17980E+22 KG CORRECTION FACTOR FOR PSTAR = 0.10000E+01 im,sm,ngroup,new_im,new_sm 3 1 1 T F NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 4*0 JMOUT = 4*0 J_PE_IND_MED = 4*0 NMEDLEV = 0 $END SLAB TIMESTEP 3 3395537 words long MODEL DUMP SUCCESSFULLY WRITTEN - 3434914 WORDS TO UNIT 22 Number of Words Written to Disk was 3436498 im,sm,ngroup,new_im,new_sm 1 1 48 T F FINAL TOTAL ENERGY = 0.45222E+27 J/ INITIAL TOTAL ENERGY = 0.45221E+27 J/ CHG IN TOTAL ENERGY OVER DAY = 0.15717E+23 J/ FLUXES INTO ATM OVER DAY = 0.67759E+23 J/ ERROR IN ENERGY BUDGET = 0.52042E+23 J/ TEMP CORRECTION OVER DAY = 0.28800E-01 K TEMPERATURE CORRECTION RATE = 0.33333E-06 K/S FLUX CORRECTION (ATM) = 0.33722E+01 W/M2 FINAL ATM MASS = 0.17980E+22 KG INITIAL ATM MASS = 0.17980E+22 KG CORRECTION FACTOR FOR PSTAR = 0.10000E+01 im,sm,ngroup,new_im,new_sm 3 1 1 T F NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 4*0 JMOUT = 4*0 J_PE_IND_MED = 4*0 NMEDLEV = 0 $END SLAB TIMESTEP 4 im,sm,ngroup,new_im,new_sm 1 1 48 T F |
Send message Joined: 5 Aug 04 Posts: 66 Credit: 2,146,056 RAC: 0 |
[url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1612048]This[\\url] result and [url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1351239]this[\\url] one, both on the same machine, failed at exactly the same point. The machine cannot get past this point in sulphur, despite having many successful slab models to its credit. Two of my other machines have also failed on Sulphur, though less repeatably. I must say that I find this problem quite frustrating. I know the team is focused on the new experiments, but if this undiagnosed problem persists with the coupled model, it will begin to sap my (considerable) commitment to this project. :( Edit: Sorry, can\'t remember how to put in links but you have the URLs at least. |
Send message Joined: 28 Nov 04 Posts: 9 Credit: 687,368 RAC: 0 |
I have similar problems with the sulphur models, one example is: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1753881 |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
@ Egon and KeeperC These generic errors messages have not been reported during the Couple Model tests. Hopefully, you\'ll be able to run this new model without problems. Arnaud |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Egon, The problem is with Linux sulphur 4.23. See this sticky in the \"BOINC Questions and Problems\" Linux forum if you haven\'t already. |
Send message Joined: 28 Nov 04 Posts: 9 Credit: 687,368 RAC: 0 |
Thanks for the info geophi, I will crunch some other boinc projects until next experiment is going live. /Egon |
©2024 cpdn.org