Message boards : Number crunching : sulphur model - Linux - Signal 11
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Feb 05 Posts: 10 Credit: 779,835 RAC: 0 |
I decided to attach one of my Linux systems to CP and all went well, \'til BOINC did a task switch. [ Log clip from BoincView ] Location Host Project Date ID Message blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1364 Requesting 34560 seconds of new work, and reporting 1 results blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1363 Reason: To fetch work blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:10:36 PM 1362 Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:33 PM 1361 Computation for result sulphur_eq2e_000686966_0 finished blnt7 blnt7 blnt7 --- 12/8/2005 12:09:33 PM 1360 request_reschedule_cpus: process exited blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:33 PM 1359 Unrecoverable error for result sulphur_eq2e_000686966_0 (process got signal 11) blnt7 blnt7 blnt7 climateprediction.net 12/8/2005 12:09:30 PM 1358 Pausing result sulphur_eq2e_000686966_0 (removed from memory) I\'ve detached this system from CP \'til this problem can be resolved, as I don\'t want to \'waste\' cpu cycles. :-) :-) I have the preferences set to \'remove from memory\' because, on another Linux box, the task wouldn\'t continue to run when it was task switched in. Is this a problem with Linux, the Linux api, or with the sulphur model? |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The only time I\'ve had this problem was when I had the preference set to \"remove from memory when preempted\". Since 5.2.x and with the preference setting of Leave applications in memory when preempted set to yes, I\'ve had no signal 11 errors. As for which application is at fault, I have no idea. |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
CJOrtega wrote: Is this a problem with Linux, the Linux api, or with the sulphur model? I guess it\'s something with the sulphor code. I tried running only sulphor on one of my machines, but unfortunately that didn\'t work either. Not everyone can leave all applications in memory, so I think the developers should look at the code what\'s causing a Segmentation violation (Signal 11). I\'m still waiting for a respose of a developer, but none replied to my message here. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
You don\'t happen to have a print-out from a command line? I\'ve had 4 of these lately: CPDN Monitor got quit request... Cleaning up graphics data... Closing graphics shared object file... Detaching shared memory... Cleaning up graphics data... One recovered itself but the other three resulted in signal 11 errors. That was on boinc 5.2.7-5.2.8, seems better now with boinc 5.2.13. I don\'t know... |
Send message Joined: 4 Feb 05 Posts: 10 Credit: 779,835 RAC: 0 |
A while back, [ on host blnt5 ] I had a problem with the cp task not running when it\'s run time came up. So I set the prefs to \'not keep in memory\', which \'fixed\' this problem. I was at BOINC V4.x at that time. I am now at BOINV V5.2.8 . Following geophi\'s response, I changed the prefs to \'keep in memory\', and monitored things [ on host blnt5 ] for a few task switch cycles. I didn\'t see any repeat of the \'not running\' problem. So I crossed my fingers, and re-attached host blnt7 to cpn, and have been monitoring things for the past several hours. I haven\'t seen any sign of the \'not running\' problem on either host, and, so far, no sign of a \'sig 11\'. I will post to this thread if I see a repeat of the failure. [edit] P.S. Both hosts are running MKD V10.2 LE Linux. |
Send message Joined: 4 Feb 05 Posts: 10 Credit: 779,835 RAC: 0 |
After 24 hrs processing of a sulphur model, host blnt7 again aborted wtih a signal 11. I\'ve detached that host from CPN. My other Linux host is still doing a slab model, and I will watch it when it finishes, as it probabily will get a sulphur model to chew on. |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
I, too have something to report. I switched \'Leave Apps in memory\' on just for CPDN. This morning, just a few minutes ago, another CPDN WU crashed when I started my PC. So, please, look into the program. There\'s a bug in there. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I, too have something to report. I switched \'Leave Apps in memory\' on just for CPDN. This morning, just a few minutes ago, another CPDN WU crashed when I started my PC. So, please, look into the program. There\'s a bug in there. Andre, Which ResultID crashed? |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
I think it was this one. It\'s the right computer and the time is right, too. I would really appreciate if a developer could look into this matter as it seems there\'s something seriously wrong with sulphur 4.22 for Linux :-/ |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The ID that you linked to crashed on the 9th with a signal 11. Must have been when the preference for \"leave applications in memory when pre-empted\" was set to \"no\" as that is the error that often occurs with a task switch, or benchmark with that preference setting. I\'ve sent an e-mail to Tolu about the signal 11 errors with links to the forum threads on it. It may be a tough one to track down though. |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
I think, you\'re wrong at this point: I *got* this WU at 9 December. It was received (as in reported) today (12 Dec). And it\'s the most recent \'completed\' WU from this computer, so I stay with this result.
It\'s certainly necessary to get this thing out of the programs. It would be really bad if a WU crashed sometime into the 3rd or 4th phase. That wouldn\'t be that easy to restore. I just hope the devs find out what\'s causing this bug. If they need some of the crashed directories, I have at least that of the WU I mentioned already. But now, I\'ll better get some sleep ;-) |
Send message Joined: 31 Aug 04 Posts: 7 Credit: 58,875,069 RAC: 0 |
Yep - we\'re also getting quite a few of these sig 11\'s eg http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1351506 As per this thread just upgraded from 5.2.4 to 5.2.13 so we\'ll see whether that makes a difference. \"Leave applications in memory when preempted\" has always been set to no, as the jobs are either run or not and wouldn\'t ever get suspended. That setting seems to have been fine for many slab models so does point to the sulphur model... I\'ll also see whether that makes a difference. Seb |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
|
Send message Joined: 4 Mar 05 Posts: 24 Credit: 243,647 RAC: 0 |
Same issue here. Running Mandriva 2006.0 Linux. My first sulphur model errored out with sig 11, just after reaching the first trickle (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1578019). To add insult to injury I did not get credit for the first trickle either. Is this because the model crashed or because it errored out? I have \"Leave applications in memory\" set to no, because when having this on yes, boinc would not stop setiathome when switching to CPDN. BOINC Client is 5.2.14 (5.2.13 optimised by crunch3r). Haven\'t had this problem with slab models. I now got a new sulphur and we\'ll see what that gives. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
In the BOINC Questions and Problems Linux forum, Tolu announced sulphur 4.23 which is supposed to have solved the signal 11 problem. Any models downloaded after late in the GMT day January 4th should have the new app (should see a 423 in the work tab of the BOINC GUI). As for the credit, with the server being down from the 6th until today, it is likely the stats scripts weren\'t run since you had the one trickle. It is only being run once a day now, and should update before 12 GMT 10 January. |
Send message Joined: 28 Aug 04 Posts: 13 Credit: 767,708 RAC: 0 |
In the BOINC Questions and Problems Linux forum, Tolu announced sulphur 4.23 which is supposed to have solved the signal 11 problem. Any models downloaded after late in the GMT day January 4th should have the new app (should see a 423 in the work tab of the BOINC GUI). Now, I\'m waiting for my first 4.23 WU to be requested. Then I\'ll switch both PCs (and their one remaining model each) manually like Honza described. Unless, of course someone has posted a How-To already? |
©2024 cpdn.org