climateprediction.net (CPDN) home page
Thread 'So what the hell caused Error -161?'

Thread 'So what the hell caused Error -161?'

Message boards : Number crunching : So what the hell caused Error -161?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user2781

Send message
Joined: 29 Aug 04
Posts: 6
Credit: 49,394
RAC: 0
Message 18547 - Posted: 21 Dec 2005, 12:24:15 UTC

So after 44 hours of my first ever sulphur model I get error -161? Any particular reason why this should occur?


21/12/2005 05:21:32|climateprediction.net|Unrecoverable error for result sulphur_dp30_000639036_0 (<file_xfer_error> <file_name>sulphur_dp30_000639036_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
21/12/2005 05:21:36|climateprediction.net|Deferring communication with project for 56 seconds
ID: 18547 · Report as offensive     Reply Quote
old_user56785
Avatar

Send message
Joined: 23 Feb 05
Posts: 55
Credit: 240,119
RAC: 0
Message 18549 - Posted: 21 Dec 2005, 13:14:28 UTC - in response to Message 18547.  

So after 44 hours of my first ever sulphur model I get error -161? Any particular reason why this should occur?


21/12/2005 05:21:32|climateprediction.net|Unrecoverable error for result sulphur_dp30_000639036_0 (<file_xfer_error> <file_name>sulphur_dp30_000639036_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_dp30_000639036_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)
21/12/2005 05:21:36|climateprediction.net|Deferring communication with project for 56 seconds


From the top of my head I would say it has to do with \'file transfer\'

Looks your model crashed for some reason. Tried to send the files but reports failure in doing so.

Eighter, the files where not generated after the crash or where corrupt.

Price yourself luck that this did not happen 44 hours before completion!
ID: 18549 · Report as offensive     Reply Quote
old_user3434
Avatar

Send message
Joined: 30 Aug 04
Posts: 77
Credit: 1,785,934
RAC: 0
Message 18568 - Posted: 21 Dec 2005, 18:12:09 UTC - in response to Message 18549.  
Last modified: 21 Dec 2005, 18:16:34 UTC

I also already lost a number of Sulphur Models (using Linux BOINC V5.2.13) .

3x from one Host

process got signal 11
<stderr_txt>
free(): invalid pointer 0xbffff8d8!

= sigsegv - segmentation violation

It happened twice after 3,597.02s and 3,595.87s respectively, which means quite exactly at their first 60 Minutes Project cycle.
3rd Time it happened after 32,511.49s, which is again quite exactly the 9th cycle for CPDN in the Multi-Project environment.

4th time (same host) if just stated process got signal 11 (no other stderr details) after 360,344.25s, which means (again) at cycle time, 100h cycle in this case.

Another Host had one with process got signal 11 as well, after 35,364.90s (closing into the 10h cycle).

Looking at the times that almost always are a factor of the 1h Project cycle, I\'ll have a close look at the Systems. Both ran other Projects flawless so far, never had specific Problems with them (?)

It\'s extremely annoying seeing those long-running Models crash out, taking down a considerable amount of CPU time with them :(
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
ID: 18568 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 18570 - Posted: 21 Dec 2005, 18:25:46 UTC

There is some hope for better times in Dave Frames post <a href=\"http://www.climateprediction.net/board/viewtopic.php?t=3412&start=15\"> here.</a>

ID: 18570 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 18571 - Posted: 21 Dec 2005, 18:45:49 UTC

FalconFly,

Did these Signal 11 crashes occur either
1. near a benchmark
or
2. as the climateprediction WU was prempted for another project?

If so, do you have \"Leave application in memory when pre-empted\" set to Yes or No?

If no, try yes and see if that helps.
ID: 18571 · Report as offensive     Reply Quote
old_user3434
Avatar

Send message
Joined: 30 Aug 04
Posts: 77
Credit: 1,785,934
RAC: 0
Message 18602 - Posted: 22 Dec 2005, 5:52:42 UTC - in response to Message 18571.  

I think all happened while being preempted.

Due to bad experiences in the past (Projects continuing to run despite being paused) with the \"Keep Applications in Memory\" setting, it is currently disabled.

I\'ll go ahead and enable it though for testing, maybe it works now.
(lost another two models overnight on different systems again)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
ID: 18602 · Report as offensive     Reply Quote
old_user3434
Avatar

Send message
Joined: 30 Aug 04
Posts: 77
Credit: 1,785,934
RAC: 0
Message 18985 - Posted: 4 Jan 2006, 10:31:15 UTC - in response to Message 18602.  

It really did the Trick, haven\'t had any such errors anymore after changing that setting :)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
ID: 18985 · Report as offensive     Reply Quote

Message boards : Number crunching : So what the hell caused Error -161?

©2024 cpdn.org