climateprediction.net (CPDN) home page
Thread 'Error while computing???'

Thread 'Error while computing???'

Message boards : Number crunching : Error while computing???
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,006,502
RAC: 21,456
Message 59243 - Posted: 26 Dec 2018, 13:41:27 UTC - in response to Message 59242.  

On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go.


In contrast, I have one 65% complete in just over 5 days with 2 days estimated to complete. However it is a retread so may not make it.
ID: 59243 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59244 - Posted: 26 Dec 2018, 14:58:38 UTC - in response to Message 59243.  

On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go.

In contrast, I have one 65% complete in just over 5 days with 2 days estimated to complete. However it is a retread so may not make it.


Well, mine is a double retread: two attempts by others have failed before it was issued to me.

Currently about 24% complete, 58 hours CPU time done, 282 hours to go. It sure is not failing in 30 to 60 seconds of CPU time like the two failures before me.
ID: 59244 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59245 - Posted: 26 Dec 2018, 15:10:59 UTC

My last failure was over a year ago. In my opinion, barring a hardware error, the only reason one gets a segmentation violation in a Linux system is if there is an error in the program.

And if my machine were getting segmentation violations in this one program, it would get them in other programs too. I have some programs that run 24/7 starting at boot up. Surely they would have problems too, and they don't.

Name wah2_sas50_l09y_198612_13_617_011131907_1
Workunit 11131907
Created 28 Jul 2017, 16:02:10 UTC
Sent 28 Jul 2017, 16:02:19 UTC
Report deadline 10 Jul 2018, 21:22:19 UTC
Received 29 Jul 2017, 20:11:45 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1256552
Run time 13 hours 52 min 54 sec
CPU time 12 hours 39 min 56 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 1.28 GFLOPS
Application version Weather At Home 2 (wah2) v8.25
i686-pc-linux-gnu
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
SIGSEGV: segmentation violation
[snip]
ID: 59245 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,006,502
RAC: 21,456
Message 59246 - Posted: 26 Dec 2018, 15:19:33 UTC

Well, mine is a double retread: two attempts by others have failed before it was issued to me.


Mine is also a double re-tread. Failures were after 2 and seven days, the 7 day machine being considerably faster than my desktop. It has a higher failure rate than this desktop but not a massive failure rate so this one may well be destined to fail. Judgement may get easier at some point next year because I understand there is a planned rebuild of the hadcm3s model which should resolve the problem of trickles not showing, though as with all these things, I am not holding my breath!

https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11669984
ID: 59246 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59247 - Posted: 27 Dec 2018, 0:11:59 UTC

These are the times for 3 of my "shorts" way back in April 2017:

hadcm3s_81bx_201412_120_564_011004032_1

Run time 3 days 22 hours 3 min 50 sec
CPU time 3 days 21 hours 22 min 23 sec


hadcm3s_82nj_201412_120_564_011005746_1

Run time 3 days 22 hours 44 min 36 sec
CPU time 3 days 22 hours 2 min 58 sec


hadcm3s_82lx_201412_120_564_011005688_1

Run time 3 days 18 hours 46 min 8 sec
CPU time 3 days 18 hours 3 min 37 sec
ID: 59247 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 59248 - Posted: 27 Dec 2018, 0:44:48 UTC - in response to Message 59242.  

What I find interesting about this work unit

hadcm3s_st249_190012_120_771_011667216

is the large amount of Run Time required (149,672.31, 138,538.43 seconds) to get 30 to 60 seconds of CPU time. This is on two different machines with different CPUs, both running 64-bit Windows 10. What are they spending that time on without using a CPU?

On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go.

It has something to do with error conditions. Obviously a lot more CPU time is used before failure on at least some of these hadcm3s models. All the ones that failed on one of my Linux PCs did so well after the first trickle, and the first trickle took 50,000+ seconds CPU time. It's when they fail in a certain way, the CPU time gets reset somehow.
ID: 59248 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59249 - Posted: 27 Dec 2018, 4:25:01 UTC - in response to Message 59247.  



These are the times for 3 of my "shorts" way back in April 2017:

hadcm3s_81bx_201412_120_564_011004032_1

Run time 3 days 22 hours 3 min 50 sec
CPU time 3 days 21 hours 22 min 23 sec


hadcm3s_82nj_201412_120_564_011005746_1

Run time 3 days 22 hours 44 min 36 sec
CPU time 3 days 22 hours 2 min 58 sec


hadcm3s_82lx_201412_120_564_011005688_1

Run time 3 days 18 hours 46 min 8 sec
CPU time 3 days 18 hours 3 min 37 sec


These seem to be normal, whether they succeeded or not. By normal, I mean that the Run time was only slightly more than the CPU time.
ID: 59249 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59258 - Posted: 29 Dec 2018, 12:34:14 UTC - in response to Message 59235.  

And this IS research. Perhaps your computer is just slightly different in a way that will mean that it WON'T fail.


Perhaps, but it failed last night, with over 100 hours of CPU time (sorry I did not write it down). There must be several bugs, though, to lose the correct amount of CPU time -- not that this one matters much.

Name hadcm3s_st249_190012_120_771_011667216_2
Workunit 11667216

Run time 5 days 4 hours 15 min 25 sec
CPU time 39 sec

Application version UK Met Office HadCM3 short v8.34
i686-pc-linux-gnu
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)
</message>
<stderr_txt>

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy
Sorry, too many model crashes! :-(
Calling boinc_finish...04:22:04 (23029): called boinc_finish(22)
In boinc_exit called with status 22
Calloing set_signal_exit_code with status 22

</stderr_txt>
]]>

Interesting that the other two got 3,111.26 credit and I got none.

Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
21453655 1256552 24 Dec 2018, 5:05:52 UTC 29 Dec 2018, 12:21:57 UTC Error while computing 447,325.52 39.37 --- UK Met Office HadCM3 short v8.34
i686-pc-linux-gnu
21389954 1425854 26 Nov 2018, 9:52:13 UTC 24 Dec 2018, 5:05:42 UTC Error while computing 149,672.31 30.61 3,111.26 UK Met Office HadCM3 short v8.34
windows_intelx86
21366260 1468717 6 Nov 2018, 14:50:15 UTC 26 Nov 2018, 9:52:08 UTC Error while computing 138,538.43 61.14 3,111.26 UK Met Office HadCM3 short v8.34
windows_intelx86
ID: 59258 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59263 - Posted: 29 Dec 2018, 20:26:59 UTC

INVALID THETA DETECTED

That's an "unacceptable physics" error, so it looks like that set of starting values finally pushed things too far.

And now you've got me doing it :(
My last running model says it's been running for 2d 22h 53m.
But the Event log says it started 3d 1h 30m ago.

It's a batch 780, and as all my batch 781s have failed with a mismatch in some of the data files, (REPLANCA), I'm guessing this one will too in a couple of hours.
So, no more until some of the project people come back from where ever and find my emails.
ID: 59263 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 59265 - Posted: 30 Dec 2018, 2:52:25 UTC - in response to Message 59258.  


Interesting that the other two got 3,111.26 credit and I got none.

But now you did. You just looked at it before the last weekly credit run on the server. :)
ID: 59265 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59268 - Posted: 30 Dec 2018, 13:12:36 UTC

I just lost two more. After about 16 seconds.

Name hadcm3s_e239_191012_120_782_011725147_0
Workunit 11725147

Application version UK Met Office HadCM3 short v8.34
i686-pc-linux-gnu
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)
</message>
<stderr_txt>
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy
Sorry, too many model crashes! :-(
Calling boinc_finish...07:03:33 (4372): called boinc_finish(22)
In boinc_exit called with status 22
Calloing set_signal_exit_code with status 22

</stderr_txt>
]]>

30-Dec-2018 07:03:01 [climateprediction.net] Starting task hadcm3s_ze54_190012_120_782_011725889_0 using hadcm3s version 834 in slot 0

30-Dec-2018 07:03:19 [climateprediction.net] Computation for task hadcm3s_ze54_190012_120_782_011725889_0 finished
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_1.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_2.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_3.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_4.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_5.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_6.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_7.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_8.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_9.zip for task hadcm3s_ze54_190012_120_782_011725
889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_10.zip for task hadcm3s_ze54_190012_120_782_01172
5889_0 absent
30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_restart.zip for task hadcm3s_ze54_190012_120_782_
011725889_0 absent
ID: 59268 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59269 - Posted: 30 Dec 2018, 13:54:07 UTC - in response to Message 59268.  

I just lost two more. After about 16 seconds.


And two more ...
ID: 59269 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 59270 - Posted: 30 Dec 2018, 16:53:52 UTC - in response to Message 59269.  

I just lost two more. After about 16 seconds.


And two more ...

As far as I can see they all fail with that error, which is new to me:

buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error
ID: 59270 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59274 - Posted: 30 Dec 2018, 19:24:01 UTC - in response to Message 59270.  

As far as I can see they all fail with that error, which is new to me:

buffout error in ASWAP! -- 17116840

Model crashed: TRANSOUT: I/O write error


New to me too. I looked at the boinc_client.log and included the part of it, above, about one of the failed work units for today. It seemed to say nothing other than the files to be uploaded did not exist. Well in 16 seconds, I would imagine they had been created yet.

I have no idea what ASWAP means. If the linux kernel wanted to page out some idle pages, it is free to do so, and it should not bother the application. Perhaps this message is not about swapping at all. Is 17116840 the number of bytes it wants to write or read?

What is TRQNSOUT about. That does look as though a read or write had a problem. My machine seems to be reading and writing OK, though.
ID: 59274 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 59379 - Posted: 10 Jan 2019, 17:36:05 UTC

The w/u ran to 100% and then gave "computing error" with msg of Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH

The job had 7 trickles waiting to upload.... when it reported the end of job the trickles were aborted (They disappeared anyway).

So I guess its a loss all round. I dont seem to do to well with Climate w/u with almost a 50% fail rate.
ID: 59379 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59385 - Posted: 10 Jan 2019, 22:22:47 UTC
Last modified: 10 Jan 2019, 22:25:35 UTC

REPLANCA is just what the rest of that line says.

Example:
You have 2 files, one contains people's names, the other how much you pay them.
The 1st file has 10 items, the 2nd file has 8.

So when you get to item 9 in the 1st list, there's no data in the 2nd list.
But where in the 2nd list were the amounts missed out?

And it's LOTS more complicated with these climate models.

So not your fault.

******************************

As the other problem, the files disappearing, that's just how BOINC works.
It's really designed for other projects, so when it gets the signal that the task has failed, the next item on it's To Do list, is send back the error messages.
Oh, and we don't need these other files any more, so remove them from the list.
:(

***********************

And please read my post here about the difference between Trickles and zips.

Your trickles (on which credits is based), would have been returned, as per a few lines in the Event log. It's the zips, in the Transfers tab, that disappeared.
ID: 59385 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 59386 - Posted: 10 Jan 2019, 23:45:22 UTC

Thanks for the info. So the zip files are the science bit. So if a w/u fails at some point and the zip files are still waiting to get uploaded then the science is lost also?.
Are partial completed w/u still of value to the project?. Its frustrating seeing 5 days of processing going to waste.... I will give it another go when the zip upload issues go away.
ID: 59386 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59387 - Posted: 11 Jan 2019, 2:42:49 UTC

Yes, the science is lost.
Those models are a bit like buying an apple with half of it rotten.
Best to buy a good one.

Or, in this case, dump the incomplete models and only use the fully completed AND RECEIVED models.
And, if necessary, issuing another batch to cover the gaps in the results.

But with the REPLANCA fails, the models will be incomplete anyway.
ID: 59387 · Report as offensive     Reply Quote
Harri Liljeroos

Send message
Joined: 9 Dec 05
Posts: 116
Credit: 12,547,934
RAC: 2,738
Message 59415 - Posted: 12 Jan 2019, 22:08:44 UTC

I just got an error with wah2 global model after 8 days and 13 hours of calculation. The error was 196 exit_disk_limit_exceeded. At least 50 zip files that were waiting to be uploaded were aborted. The task is here: https://www.cpdn.org/cpdnboinc/result.php?resultid=21361420
ID: 59415 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59417 - Posted: 12 Jan 2019, 23:34:50 UTC - in response to Message 59415.  

I just got an error with wah2 global model after 8 days and 13 hours of calculation. The error was 196 exit_disk_limit_exceeded. At least 50 zip files that were waiting to be uploaded were aborted. The task is here: https://www.cpdn.org/cpdnboinc/result.php?resultid=21361420


Which is what I said might happen here.
ID: 59417 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Error while computing???

©2024 cpdn.org