climateprediction.net (CPDN) home page
Thread 'Error while computing???'

Thread 'Error while computing???'

Message boards : Number crunching : Error while computing???
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59421 - Posted: 13 Jan 2019, 11:00:27 UTC - in response to Message 59417.  

Is it possible to change this exit_disc_limit value? Its not in cc_config.xml.
ID: 59421 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,007,330
RAC: 21,449
Message 59422 - Posted: 13 Jan 2019, 11:40:49 UTC - in response to Message 59421.  

Is it possible to change this exit_disc_limit value? Its not in cc_config.xml.


I have vague memories of a discussion about whether this was hard wired into the BOINC code or was in one of the files downloaded for each task. In either case I suspect the answer is certainly not easily. I guess in the case of the former, you could look for the value in the code and roll your own. In the case of the latter, I wouldn't even know where to start.
ID: 59422 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59423 - Posted: 13 Jan 2019, 18:07:22 UTC - in response to Message 59422.  

I'm not that good!
ID: 59423 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59424 - Posted: 13 Jan 2019, 18:30:39 UTC

It's one of several values placed into a file, before it and lots of others are bundled up and placed in the download queue.
And it's not a short, simple, number, so don't even think about it.
ID: 59424 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59425 - Posted: 13 Jan 2019, 19:12:17 UTC

My latest one is done:

Name hadcm3s_x5300_190012_60_771_011668342_2
Workunit 11668342

But, not surprisingly, it will not upload...

Sun 13 Jan 2019 10:54:31 AM EST | climateprediction.net | Computation for task hadcm3s_x5300_190012_60_771_011668342_2 finished
Sun 13 Jan 2019 10:54:31 AM EST | Rosetta@home | Resuming task rb_01_11_87945_129797__t000__2_C1_SAVE_ALL_OUT_IGNORE_THE_REST_711816_384_0 using minirosetta version 378 in slot 2
Sun 13 Jan 2019 10:54:36 AM EST | climateprediction.net | Started upload of hadcm3s_x5300_190012_60_771_011668342_2_r376222488_out.zip
Sun 13 Jan 2019 10:54:38 AM EST | | Project communication failed: attempting access to reference site
Sun 13 Jan 2019 10:54:38 AM EST | climateprediction.net | Temporarily failed upload of hadcm3s_x5300_190012_60_771_011668342_2_r376222488_out.zip: connect() failed
Sun 13 Jan 2019 10:54:38 AM EST | climateprediction.net | Backing off 00:03:26 on upload of hadcm3s_x5300_190012_60_771_011668342_2_r376222488_out.zip

There are the 5 regular .zip files and also out.zip and restart.zip.

I assume they will upload eventually.
ID: 59425 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59451 - Posted: 16 Jan 2019, 22:53:44 UTC - in response to Message 59387.  

Same error from batch 781

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xadae.pipe_dummy
Leaving CPDN_ain::Monitor...
02:23:23 (9916): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_safr50_n0r8_198912_14_781_011715612_0_r723054499_14.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
ID: 59451 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59452 - Posted: 17 Jan 2019, 23:02:06 UTC - in response to Message 59451.  

And another one

upload failure: <file_xfer_error>
<file_name>wah2_safr50_n3e5_199912_14_781_011719029_0_r2014729666_14.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
ID: 59452 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59453 - Posted: 18 Jan 2019, 1:35:04 UTC

Note to all:

All batch 781 tasks can be aborted.
ID: 59453 · Report as offensive     Reply Quote
KWSN Sir Clark

Send message
Joined: 8 Jul 05
Posts: 33
Credit: 1,274,211
RAC: 0
Message 59463 - Posted: 18 Jan 2019, 15:59:56 UTC - in response to Message 59453.  

Note to all:

All batch 781 tasks can be aborted.


Could this be sent to Boinc Managers automatically by the server?
ID: 59463 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 59464 - Posted: 18 Jan 2019, 19:50:45 UTC - in response to Message 59463.  

Note to all:

All batch 781 tasks can be aborted.


Could this be sent to Boinc Managers automatically by the server?


There used to be something called the “killer trickle” which did that. I’ll ask if that is required and still possible.
ID: 59464 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59465 - Posted: 18 Jan 2019, 20:01:47 UTC

I noticed a message elsewhere, about the researcher finding the cause of the problems, and closing the batch.

Telling regular posters here to not waste more time on that batch is one thing, but getting through to all of the set and forget who may never look at their Manager is another. And then there's those who have the messages turned off.

So I'm not going to bother.
ID: 59465 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,466,907
RAC: 90,404
Message 59466 - Posted: 18 Jan 2019, 20:21:27 UTC - in response to Message 59465.  

Why don't they cancel the batch from the server and cancel the units from the machines ?
ID: 59466 · Report as offensive     Reply Quote
KWSN Sir Clark

Send message
Joined: 8 Jul 05
Posts: 33
Credit: 1,274,211
RAC: 0
Message 59467 - Posted: 19 Jan 2019, 0:17:35 UTC - in response to Message 59466.  

Why don't they cancel the batch from the server and cancel the units from the machines ?


Yep. There's a facility to do this. Other projects like WCG do it if a WU is not needed due to a quorum being met by a late returning WU. I'm not sure whether it works on tasks that are already in progress on a host.

Hopefully the project folk here can sort it.
ID: 59467 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59468 - Posted: 19 Jan 2019, 1:23:41 UTC

The project people can do all these things; when they're not away, and it's not a weekend.

I just saw a message on one of our private boards about it, so I thought that I'd give people advance notice.
Those that don't want to delete them don't have to.
ID: 59468 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,007,330
RAC: 21,449
Message 59469 - Posted: 19 Jan 2019, 8:32:37 UTC - in response to Message 59468.  

Those that don't want to delete them don't have to.


And assuming they run till the end before producing an error credit will still be granted for the trickle up messages.

(I have however deleted mine.)
ID: 59469 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59472 - Posted: 20 Jan 2019, 11:38:27 UTC

Got segment violation errors on tasks from batches 777 and 780. Both appear to be after 9th zip file as zips from 10 onwards are not generated.
ID: 59472 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 59473 - Posted: 20 Jan 2019, 14:44:11 UTC

Well its not a huge success.... I have completed 2 w/u since rejoining. Both have crashed with
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
Both at the end of their runs.. a total of 10.5 days of processing/science wasted. Not a big deal in the world of climate prediction but not very encouraging to get more work. Time to wander off for a while I think.
ID: 59473 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,007,330
RAC: 21,449
Message 59474 - Posted: 20 Jan 2019, 15:17:46 UTC - in response to Message 59473.  

Well its not a huge success.... I have completed 2 w/u since rejoining. Both have crashed with
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
Both at the end of their runs.. a total of 10.5 days of processing/science wasted.


Model 781, is the one where we have just (a little belatedly) been told we can abort. When the fixed files have been added this batch will be re-released with a new batch number. Possibly some time next week.
ID: 59474 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 59475 - Posted: 20 Jan 2019, 20:41:49 UTC

After aborting the 781's I received three more of them.
ID: 59475 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59476 - Posted: 20 Jan 2019, 21:32:24 UTC

I hope that you got rid of them as well, because ALL batch 781 is missing a month or two of data from the end of one of it's files.

Perhaps the new instructions should be:
1) Set project to: No New Tasks
2) Abort the faulty models
3) Wait
ID: 59476 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Error while computing???

©2024 cpdn.org