climateprediction.net (CPDN) home page
Thread 'sulphur model - Linux - Signal 11'

Thread 'sulphur model - Linux - Signal 11'

Message boards : Number crunching : sulphur model - Linux - Signal 11
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user52163

Send message
Joined: 4 Feb 05
Posts: 10
Credit: 779,835
RAC: 0
Message 17908 - Posted: 8 Dec 2005, 18:44:41 UTC

I decided to attach one of my Linux systems to CP and all went well, \'til BOINC did a task switch.

[ Log clip from BoincView ]

Location Host Project Date ID Message
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1364 Requesting 34560 seconds of new work, and reporting 1 results
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1363 Reason: To fetch work
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1362 Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:33 PM 1361 Computation for result sulphur_eq2e_000686966_0 finished
blnt7 blnt7 blnt7 --- 12/8/2005 12:09:33 PM 1360 request_reschedule_cpus: process exited
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:33 PM 1359 Unrecoverable error for result sulphur_eq2e_000686966_0 (process got signal 11)
blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:30 PM 1358 Pausing result sulphur_eq2e_000686966_0 (removed from memory)

I\'ve detached this system from CP \'til this problem can be resolved, as I don\'t want to \'waste\' cpu cycles.

:-) :-)

I have the preferences set to \'remove from memory\' because, on another Linux box, the task wouldn\'t continue to run when it was task switched in.

Is this a problem with Linux, the Linux api, or with the sulphur model?


ID: 17908 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 17909 - Posted: 8 Dec 2005, 18:52:05 UTC

The only time I\'ve had this problem was when I had the preference set to \"remove from memory when preempted\". Since 5.2.x and with the preference setting of Leave applications in memory when preempted set to yes, I\'ve had no signal 11 errors. As for which application is at fault, I have no idea.
ID: 17909 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 17925 - Posted: 9 Dec 2005, 7:40:30 UTC
Last modified: 9 Dec 2005, 7:40:47 UTC

CJOrtega wrote:
Is this a problem with Linux, the Linux api, or with the sulphur model?


I guess it\'s something with the sulphor code. I tried running only sulphor on one of my machines, but unfortunately that didn\'t work either. Not everyone can leave all applications in memory, so I think the developers should look at the code what\'s causing a Segmentation violation (Signal 11).

I\'m still waiting for a respose of a developer, but none replied to my message here.
ID: 17925 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 17950 - Posted: 9 Dec 2005, 21:36:34 UTC
Last modified: 9 Dec 2005, 21:40:03 UTC

You don\'t happen to have a print-out from a command line?
I\'ve had 4 of these lately:

CPDN Monitor got quit request...
Cleaning up graphics data...
Closing graphics shared object file...
Detaching shared memory...
Cleaning up graphics data...

One recovered itself but the other three resulted in signal 11 errors.
That was on boinc 5.2.7-5.2.8, seems better now with boinc 5.2.13.
I don\'t know...
ID: 17950 · Report as offensive     Reply Quote
old_user52163

Send message
Joined: 4 Feb 05
Posts: 10
Credit: 779,835
RAC: 0
Message 17955 - Posted: 9 Dec 2005, 22:53:40 UTC
Last modified: 9 Dec 2005, 23:03:21 UTC

A while back, [ on host blnt5 ] I had a problem with the cp task not running when it\'s run time came up. So I set the prefs to \'not keep in memory\', which \'fixed\' this problem. I was at BOINC V4.x at that time. I am now at BOINV V5.2.8 .

Following geophi\'s response, I changed the prefs to \'keep in memory\', and monitored things [ on host blnt5 ] for a few task switch cycles. I didn\'t see any repeat of the \'not running\' problem.

So I crossed my fingers, and re-attached host blnt7 to cpn, and have been monitoring things for the past several hours. I haven\'t seen any sign of the \'not running\' problem on either host, and, so far, no sign of a \'sig 11\'.

I will post to this thread if I see a repeat of the failure.

[edit] P.S. Both hosts are running MKD V10.2 LE Linux.

ID: 17955 · Report as offensive     Reply Quote
old_user52163

Send message
Joined: 4 Feb 05
Posts: 10
Credit: 779,835
RAC: 0
Message 18065 - Posted: 11 Dec 2005, 19:28:48 UTC

After 24 hrs processing of a sulphur model, host blnt7 again aborted wtih a signal 11.

I\'ve detached that host from CPN.

My other Linux host is still doing a slab model, and I will watch it when it finishes, as it probabily will get a sulphur model to chew on.


ID: 18065 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 18083 - Posted: 12 Dec 2005, 3:47:26 UTC

I, too have something to report. I switched \'Leave Apps in memory\' on just for CPDN. This morning, just a few minutes ago, another CPDN WU crashed when I started my PC. So, please, look into the program. There\'s a bug in there.
ID: 18083 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 18085 - Posted: 12 Dec 2005, 4:24:41 UTC - in response to Message 18083.  

I, too have something to report. I switched \'Leave Apps in memory\' on just for CPDN. This morning, just a few minutes ago, another CPDN WU crashed when I started my PC. So, please, look into the program. There\'s a bug in there.

Andre,
Which ResultID crashed?
ID: 18085 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 18109 - Posted: 12 Dec 2005, 19:11:32 UTC - in response to Message 18085.  


Andre,
Which ResultID crashed?

I think it was this one. It\'s the right computer and the time is right, too.

I would really appreciate if a developer could look into this matter as it seems there\'s something seriously wrong with sulphur 4.22 for Linux :-/
ID: 18109 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 18111 - Posted: 12 Dec 2005, 20:35:22 UTC - in response to Message 18109.  


Andre,
Which ResultID crashed?

I think it was this one. It\'s the right computer and the time is right, too.

I would really appreciate if a developer could look into this matter as it seems there\'s something seriously wrong with sulphur 4.22 for Linux :-/

The ID that you linked to crashed on the 9th with a signal 11. Must have been when the preference for \"leave applications in memory when pre-empted\" was set to \"no\" as that is the error that often occurs with a task switch, or benchmark with that preference setting.

I\'ve sent an e-mail to Tolu about the signal 11 errors with links to the forum threads on it. It may be a tough one to track down though.
ID: 18111 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 18115 - Posted: 12 Dec 2005, 21:53:58 UTC - in response to Message 18111.  


The ID that you linked to crashed on the 9th with a signal 11. Must have been when the preference for \"leave applications in memory when pre-empted\" was set to \"no\" as that is the error that often occurs with a task switch, or benchmark with that preference setting.


I think, you\'re wrong at this point: I *got* this WU at 9 December. It was received (as in reported) today (12 Dec). And it\'s the most recent \'completed\' WU from this computer, so I stay with this result.


I\'ve sent an e-mail to Tolu about the signal 11 errors with links to the forum threads on it. It may be a tough one to track down though.


It\'s certainly necessary to get this thing out of the programs. It would be really bad if a WU crashed sometime into the 3rd or 4th phase. That wouldn\'t be that easy to restore. I just hope the devs find out what\'s causing this bug. If they need some of the crashed directories, I have at least that of the WU I mentioned already.

But now, I\'ll better get some sleep ;-)
ID: 18115 · Report as offensive     Reply Quote
Profileold_user6033

Send message
Joined: 31 Aug 04
Posts: 7
Credit: 58,875,069
RAC: 0
Message 18208 - Posted: 14 Dec 2005, 20:33:04 UTC

Yep - we\'re also getting quite a few of these sig 11\'s
eg http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1351506

As per this thread just upgraded from 5.2.4 to 5.2.13 so we\'ll see whether that makes a difference.

\"Leave applications in memory when preempted\" has always been set to no, as the jobs are either run or not and wouldn\'t ever get suspended. That setting seems to have been fine for many slab models so does point to the sulphur model... I\'ll also see whether that makes a difference.

Seb




ID: 18208 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 18578 - Posted: 21 Dec 2005, 20:18:39 UTC
Last modified: 21 Dec 2005, 20:19:13 UTC

It\'s still happening to one of my Linux boxes. Any news from the staff about this problem? It\'s really frustrating to see another WU crash when I\'m booting the PC...

[edit: Tpyo]
ID: 18578 · Report as offensive     Reply Quote
Profileold_user60427
Avatar

Send message
Joined: 4 Mar 05
Posts: 24
Credit: 243,647
RAC: 0
Message 19098 - Posted: 9 Jan 2006, 20:59:37 UTC

Same issue here. Running Mandriva 2006.0 Linux. My first sulphur model errored out with sig 11, just after reaching the first trickle (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1578019). To add insult to injury I did not get credit for the first trickle either. Is this because the model crashed or because it errored out?

I have \"Leave applications in memory\" set to no, because when having this on yes, boinc would not stop setiathome when switching to CPDN. BOINC Client is 5.2.14 (5.2.13 optimised by crunch3r). Haven\'t had this problem with slab models. I now got a new sulphur and we\'ll see what that gives.
ID: 19098 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19100 - Posted: 9 Jan 2006, 21:21:15 UTC

In the BOINC Questions and Problems Linux forum, Tolu announced sulphur 4.23 which is supposed to have solved the signal 11 problem. Any models downloaded after late in the GMT day January 4th should have the new app (should see a 423 in the work tab of the BOINC GUI).

As for the credit, with the server being down from the 6th until today, it is likely the stats scripts weren\'t run since you had the one trickle. It is only being run once a day now, and should update before 12 GMT 10 January.
ID: 19100 · Report as offensive     Reply Quote
old_user2354

Send message
Joined: 28 Aug 04
Posts: 13
Credit: 767,708
RAC: 0
Message 19137 - Posted: 10 Jan 2006, 11:43:04 UTC - in response to Message 19100.  

In the BOINC Questions and Problems Linux forum, Tolu announced sulphur 4.23 which is supposed to have solved the signal 11 problem. Any models downloaded after late in the GMT day January 4th should have the new app (should see a 423 in the work tab of the BOINC GUI).

As for the credit, with the server being down from the 6th until today, it is likely the stats scripts weren\'t run since you had the one trickle. It is only being run once a day now, and should update before 12 GMT 10 January.


Now, I\'m waiting for my first 4.23 WU to be requested. Then I\'ll switch both PCs (and their one remaining model each) manually like Honza described. Unless, of course someone has posted a How-To already?
ID: 19137 · Report as offensive     Reply Quote

Message boards : Number crunching : sulphur model - Linux - Signal 11

©2024 cpdn.org