Questions and Answers : Unix/Linux : new app. 4.23 resolves signal 11 bug
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 173 Credit: 1,843,046 RAC: 0 |
The new app version 4.23 resolves the signal 11 error reported. also speed improvement in on intel procs. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Great news and congratulations to the software development team! However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet been addressed. This bug seems to affect mostly Redhat distributions. It is very deterministic. On some machines, the sulphur model simply won\'t start. It hangs in a kind of deadlock state, without any CPU consumption. I have tested this bug on 7 machines with different kernel versions and performances. Only 1 machine does not suffer from this strange bug, but I was not yet unable to determine what is different on that machine than the others (there was another machines with then same kernel version which did not work, for example). This shows hoewever that this IS a very serious bug, rendering the application completely unusable on many Linux machines. Cheers, Stefan. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet been addressed. Could you give us a link to the other threads that may have talked about this, or at least more documentation of the error. I\'ve had no problem starting sulphur on 3 different Fedora Core installations (1 FC3 and 2 FC4). |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet Well, I think this thread is related to it: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=3737 Yes, I would gladly help out and give you as many details as possible about the machines. I am only a starter on Linux, so if you could tell me any particular tests (like for example certain commands) I should run on the machines, I would gladly run them and give you their outputs. Up to this point, all I know is to query the kernel version by means of the \"uname -a\" command. (I think a real Linux Guru will have a lot of laugh when he will read this). The machines I am testing on are part of my University\'s network (Apart from my own machine, I will not run the project on those machines since they don\'t belong to me, but I can use them to test and try to figure out the problem). Here are some stats I have: Machine name: glenora Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.00GHz Memory: 500.84 MB physical, 1.00 GB virtual uname output: Linux glenora.cs 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown Machine name: shading Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.60GHz Memory: 1.97 GB physical, 1.00 GB virtual uname output: Linux shading.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown Machine name: shape Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.60GHz Memory: 1.97 GB physical, 1.00 GB virtual uname output: Linux shape.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown Machine name: mlp1 Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.53GHz Memory: 501.61 MB physical, 1.00 GB virtual uname output: Linux mlp1.ai 2.4.20-30.7.legacy #1 Fri Feb 20 10:46:44 PST 2004 i686 unknown Machine name: bauda Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 1.90GHz Memory: 500.84 MB physical, 1.00 GB virtual uname output: Linux bauda.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown Notes: * Out of all these machines, \"shading\" is the only one that works. * Up to this point, I\'ve been running HADSM models and it all worked fine. Only when I switched to sulphur the problems appeared. Please tell what more information your require, and I will do my best to get it to for you. Warm regards, Stefan. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I\'m not Linux guru either. Try this on shading and shape, since they look so similar and one works and the other doesn\'t: ulimit -a and paste the output from those two PCs in your reply. This is what I got on one of my PCs that has had no problems with starting sulphur (P4 3.4 GHz, 512 MB RAM, Mandrake 10.1): root@localhost root# ulimit -a core file size (blocks, -c) 1000000 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4095 virtual memory (kbytes, -v) unlimited |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
[quote]I\'m not Linux guru either. Try this on shading and shape, since they look so similar and one works and the other doesn\'t: Ok, here they are: shape: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 7168 virtual memory (kbytes, -v) unlimited shading: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 7168 virtual memory (kbytes, -v) unlimited They seem identical :(. Yet I re-checked just now (for the 3rd time), in order to make sure that I don\'t mislead you: Indeed, model gets deadlock on SHAPE, works fine on SHADING. I keep wandering then, what is different between these two computers? The SAME executable works fine on one , gets deadlocked on the other... Strange... Well, Geophi, please tell me what other tests we might carry on next. Cheers, Stefan. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Indeed, model gets deadlock on SHAPE, works fine on SHADING. My lack of guruness is showing. I\'m afraid at this point we will have to wait for a real guru. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Hi, Geophi. I don\'t know if it is of any help, but just in case... Here are also two process snapshots (ps -Alw), while running sulphur. The first one is taken on SHAPE, the second on SHADING. F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 100 S 0 1 0 0 68 0 - 331 do_sel ? 00:00:04 init 040 S 0 2 1 0 69 0 - 0 contex ? 00:00:00 keventd 040 S 0 3 1 0 69 0 - 0 apm_ma ? 00:00:00 kapmd 040 S 0 4 1 0 79 19 - 0 ksofti ? 00:00:00 ksoftirqd_CPU0 040 S 0 5 1 0 69 0 - 0 kswapd ? 00:00:00 kswapd 040 S 0 6 1 0 69 0 - 0 bdflus ? 00:00:00 bdflush 040 S 0 7 1 0 69 0 - 0 kupdat ? 00:00:00 kupdated 040 S 0 9 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_2 040 S 0 10 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_3 040 S 0 11 1 0 69 0 - 0 usb_hu ? 00:00:00 khubd 040 S 0 16 1 0 59 -20 - 0 md_thr ? 00:00:00 mdrecoveryd 040 S 0 17 1 0 69 0 - 0 kjourn ? 00:00:04 kjournald 040 S 0 147 1 0 69 0 - 0 kjourn ? 00:00:00 kjournald 040 S 0 446 1 0 69 0 - 331 nanosl ? 00:00:00 dhcpcd 140 S 0 551 1 0 69 0 - 355 do_sel ? 00:00:01 syslogd 140 S 0 556 1 0 69 0 - 330 do_sys ? 00:00:00 klogd 140 S 1 576 1 0 69 0 - 404 do_pol ? 00:00:00 portmap 140 S 29 604 1 0 69 0 - 378 do_sel ? 00:00:00 rpc.statd 140 S 0 719 1 0 68 0 - 328 do_sel ? 00:00:00 apmd 040 S 0 808 1 0 68 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 810 1 0 68 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 826 1 0 69 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 847 1 0 69 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 868 1 0 68 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 924 1 0 69 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 926 1 0 69 0 - 356 pipe_w ? 00:00:00 automount 040 S 0 928 1 0 69 0 - 356 pipe_w ? 00:00:00 automount 140 S 0 982 1 0 69 0 - 491 do_sel ? 00:00:00 xinetd 140 S 0 1003 1 0 69 0 - 355 do_sel ? 00:00:00 lpd 140 S 0 1022 1 0 69 0 - 337 wait_f ? 00:00:22 rwhod 040 S 0 1026 1022 0 69 0 - 338 nanosl ? 00:00:00 rwhod 040 S 0 1474 1 0 68 0 - 342 nanosl ? 00:00:00 crond 040 S 1 1599 1 0 69 0 - 339 nanosl ? 00:00:00 atd 140 S 0 1620 1 0 69 0 - 576 do_sel ? 00:00:00 gated 100 S 0 1680 1 0 69 0 - 547 do_sel ? 00:00:00 master 100 S 1026 1688 1680 0 69 0 - 570 do_sel ? 00:00:00 qmgr 140 S 0 1691 1 0 69 0 - 623 do_sel ? 00:00:07 sshd 100 S 0 1716 1 0 69 0 - 325 read_c tty1 00:00:00 mingetty 100 S 0 1717 1 0 69 0 - 325 read_c tty2 00:00:00 mingetty 100 S 0 1718 1 0 69 0 - 325 read_c tty3 00:00:00 mingetty 100 S 0 1719 1 0 69 0 - 325 read_c tty4 00:00:00 mingetty 100 S 0 1720 1 0 69 0 - 325 read_c tty5 00:00:00 mingetty 100 S 0 1721 1 0 69 0 - 325 read_c tty6 00:00:00 mingetty 140 S 0 1725 1 0 69 0 - 794 do_sel ? 00:00:00 xdm 140 S 0 6488 1691 0 69 0 - 1308 unix_s ? 00:00:00 sshd 140 S 1915 6490 6488 0 69 0 - 1309 do_sel ? 00:00:00 sshd 000 S 1915 6491 6490 0 69 0 - 693 rt_sig pts/0 00:00:00 tcsh 040 S 0 6495 1 0 69 0 - 0 rpciod ? 00:00:00 rpciod 040 S 0 6496 1 0 69 0 - 0 svc_re ? 00:00:00 lockd 140 S 0 6739 1691 0 69 0 - 1286 unix_s ? 00:00:00 sshd 140 S 1814 6744 6739 0 69 0 - 1356 do_sel ? 00:00:28 sshd 000 S 1814 6746 6744 0 69 0 - 522 wait4 pts/2 00:00:00 bash 000 S 1814 6779 6746 0 69 0 - 543 wait4 pts/2 00:00:00 bash 000 S 1814 6966 6779 0 69 0 - 76638 do_sel pts/2 00:00:03 MATLAB 000 S 1814 7016 6966 0 69 0 - 406 pipe_w pts/3 00:00:00 matlab_helper 040 S 1814 7017 6966 0 68 0 - 76638 do_pol pts/2 00:00:00 MATLAB 040 S 1814 7018 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB 040 S 1814 7019 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB 040 S 1814 7020 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7021 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7022 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7023 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7024 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7025 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB 040 S 1814 7026 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7027 7017 0 69 0 - 76638 do_pol pts/2 00:00:00 MATLAB 040 S 1814 7028 7017 0 69 0 - 76638 do_sel pts/2 00:00:00 MATLAB 040 S 1814 7029 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7031 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7035 7017 0 69 0 - 76638 rt_sig pts/2 00:00:04 MATLAB 040 S 1814 7036 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7037 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7038 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 040 S 1814 7042 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB 040 S 1814 7044 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB 100 S 1026 7212 1680 0 69 0 - 557 do_sel ? 00:00:00 pickup 000 S 1915 7875 6491 0 69 0 - 581 wait4 pts/0 00:00:00 bash 100 S 0 8125 1725 0 69 0 - 20162 do_sel ? 00:00:00 X 040 S 0 8126 1725 0 69 0 - 854 do_sel ? 00:00:00 xdm 000 S 1814 8205 6779 0 69 0 - 5970 wait4 pts/2 00:00:00 dm 140 S 0 8302 1691 0 69 0 - 1306 unix_s ? 00:00:00 sshd 140 S 1915 8304 8302 0 69 0 - 1287 do_sel ? 00:00:00 sshd 000 S 1915 8305 8304 0 71 0 - 693 rt_sig pts/1 00:00:00 tcsh 100 S 59997 8319 982 0 69 0 - 369 wait_f ? 00:00:00 spingd 000 S 1915 8359 7875 0 69 0 - 958 do_sel pts/0 00:00:00 boinc 000 S 1915 8360 8359 0 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6 040 S 1915 8361 8360 0 79 19 - 4492 do_pol pts/0 00:00:00 sulphur_4.23_i6 040 S 1915 8362 8361 2 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6 040 S 1915 8363 8361 0 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6 040 R 1814 8364 8205 99 77 0 - 10710 - pts/2 00:00:03 dm 000 R 1915 8365 8362 0 79 19 - 3981 - pts/0 00:00:00 sulphur_um_4.23 000 R 1915 8366 8305 0 73 0 - 834 - pts/1 00:00:00 ps F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 100 S 0 1 0 0 68 0 - 343 do_sel ? 00:00:04 init 040 S 0 2 1 0 69 0 - 0 contex ? 00:00:00 keventd 040 S 0 3 1 0 69 0 - 0 apm_ma ? 00:00:00 kapmd 040 S 0 4 1 0 79 19 - 0 ksofti ? 00:00:00 ksoftirqd_CPU0 040 S 0 5 1 0 69 0 - 0 kswapd ? 00:00:00 kswapd 040 S 0 6 1 0 69 0 - 0 bdflus ? 00:00:00 bdflush 040 S 0 7 1 0 69 0 - 0 kupdat ? 00:00:00 kupdated 040 S 0 9 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_2 040 S 0 10 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_3 040 S 0 11 1 0 69 0 - 0 usb_hu ? 00:00:00 khubd 040 S 0 16 1 0 59 -20 - 0 md_thr ? 00:00:00 mdrecoveryd 040 S 0 17 1 0 69 0 - 0 kjourn ? 00:00:06 kjournald 040 S 0 144 1 0 69 0 - 0 kjourn ? 00:00:01 kjournald 040 S 0 451 1 0 69 0 - 342 nanosl ? 00:00:00 dhcpcd 140 S 0 556 1 0 69 0 - 364 do_sel ? 00:00:01 syslogd 140 S 0 561 1 0 69 0 - 341 do_sys ? 00:00:00 klogd 140 S 1 581 1 0 69 0 - 415 do_pol ? 00:00:00 portmap 140 S 29 609 1 0 69 0 - 390 do_sel ? 00:00:00 rpc.statd 140 S 0 725 1 0 68 0 - 340 do_sel ? 00:00:00 apmd 040 S 0 811 1 0 68 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 813 1 0 68 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 831 1 0 69 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 852 1 0 69 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 876 1 0 68 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 897 1 0 69 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 931 1 0 68 0 - 368 pipe_w ? 00:00:00 automount 040 S 0 933 1 0 69 0 - 368 pipe_w ? 00:00:00 automount 140 S 0 986 1 0 69 0 - 535 do_sel ? 00:00:00 xinetd 140 S 0 1008 1 0 68 0 - 366 do_sel ? 00:00:00 lpd 140 S 0 1026 1 0 69 0 - 347 wait_f ? 00:00:27 rwhod 040 S 0 1043 1026 0 69 0 - 348 nanosl ? 00:00:00 rwhod 040 S 0 1501 1 0 68 0 - 384 nanosl ? 00:00:00 crond 040 S 1 1604 1 0 69 0 - 351 nanosl ? 00:00:00 atd 140 S 0 1625 1 0 69 0 - 586 do_sel ? 00:00:00 gated 100 S 0 1685 1 0 69 0 - 559 do_sel ? 00:00:00 master 100 S 1026 1695 1685 0 69 0 - 582 do_sel ? 00:00:00 qmgr 140 S 0 1696 1 0 69 0 - 635 do_sel ? 00:00:09 sshd 100 S 0 1721 1 0 69 0 - 336 read_c tty1 00:00:00 mingetty 100 S 0 1722 1 0 69 0 - 336 read_c tty2 00:00:00 mingetty 100 S 0 1723 1 0 69 0 - 336 read_c tty3 00:00:00 mingetty 100 S 0 1724 1 0 69 0 - 336 read_c tty4 00:00:00 mingetty 100 S 0 1725 1 0 69 0 - 336 read_c tty5 00:00:00 mingetty 100 S 0 1726 1 0 69 0 - 336 read_c tty6 00:00:00 mingetty 140 S 0 1730 1 0 69 0 - 805 do_sel ? 00:00:00 xdm 040 S 0 10552 1 0 69 0 - 0 rpciod ? 00:00:06 rpciod 040 S 0 10553 1 0 69 0 - 0 svc_re ? 00:00:00 lockd 140 S 0 10743 1696 0 69 0 - 1320 unix_s ? 00:00:00 sshd 140 S 1915 10745 10743 0 69 0 - 1324 do_sel ? 00:00:01 sshd 000 S 1915 10746 10745 0 69 0 - 852 rt_sig pts/0 00:00:00 tcsh 000 S 1915 10901 10746 0 69 0 - 566 wait4 pts/0 00:00:00 bash 000 S 1915 10906 10901 0 69 0 - 841 do_sel pts/0 00:00:00 mc 000 S 1915 10908 10906 0 69 0 - 683 rt_sig pts/2 00:00:00 tcsh 000 S 1915 10948 10908 0 74 0 - 552 wait4 pts/2 00:00:00 bash 100 S 1915 10950 10948 0 73 5 - 1013 do_sel pts/2 00:00:00 boinc 000 S 1915 10952 10950 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6 040 S 1915 10953 10952 0 79 19 - 5985 do_pol pts/2 00:00:00 sulphur_4.22_i6 040 S 1915 10954 10953 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6 040 S 1915 10955 10953 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6 000 R 1915 10956 10954 98 79 19 - 31574 - pts/2 15:42:07 sulphur_um_4.22 040 S 1915 10957 10956 0 79 19 - 31574 do_pol pts/2 00:00:00 sulphur_um_4.22 040 S 1915 10958 10957 0 79 19 - 31574 nanosl pts/2 00:00:00 sulphur_um_4.22 100 S 0 12312 1730 0 69 0 - 20160 do_sel ? 00:00:00 X 040 S 0 12313 1730 0 68 0 - 864 do_sel ? 00:00:00 xdm 100 S 1026 13355 1685 0 69 0 - 568 do_sel ? 00:00:00 pickup 100 S 59997 13479 986 0 69 0 - 379 wait_f ? 00:00:00 spingd 000 R 1915 13483 10948 0 77 0 - 846 - pts/2 00:00:00 ps As you can see, on SHAPE, the application does not consume any CPU time and seems to be deadlocked. After 180 seconds, BOINC kills it and gives a failure to start model message: Model timeout at 180.00 seconds On SHADING, the application is running fine - model starup takes only a few seconds. At the moment it\'s been running for about 15 hours - I let it run during last night just for testing, since the computer is now idle due to the holidays, so I don\'t bother anyone by running a job on it. This seems to be a real mistery. Cheers, Stefan. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Oh, Geophi, one more thing. On my previous listing, SHADING (the working computer) was running SULPHUR 4.22 instead of SULPHUR 4.23 (this was because I made the first experiments before the app has been updated). However, I have just checked everything works exactly the same when running model 4.23, so the bug is still in 4.23 as it was in 4.22. Stefan. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I notice that MATLAB is running on the computer that\'s not working. Coincidence? |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Nice observation. Yes, indeed, one of my colleagues is running a MATLAB job on one of the computers. Unfortunately, I checked on the other computers. There are computers (including my own) which are not running MATLAB at the moment and SULPHUR still doesn\'t work on them. You think the cause could be some process running in the background? Cheers, Stefan. P.S. Unfortunately, in the case of SHAPE I cannot stop the other process from running (since it is not my own :) ). |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The little Linux I know doesn\'t go down to this level, but from general computer use, it\'s possibly a background process. But cpdn apps, especially sulphur, are complex beasts, and don\'t like other programs getting in the way while they\'re running. I\'m not looking forward to the start of the coupled ocean model. There\'s going to a lot tears and wailing, I fear. :( |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
You think the cause could be some process running in the background? For your general preferences for these computers, do you have \"Do work while computer in use\" set to yes? |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Yes, I\'ve just checked the preferences, it is set to \"yes\" (and also checked the XML file: the <run_if_user_active/> tag is there). But I don\'t think that could have been the problem anyway, since, as far as I understand, that options tells BOINC not to start the sulphur application at all if the user is active (or unload it from memory as he/she becomes active). In my case, BOINC is TRYING TO START sulphur (but fails to do so, resulting in an unrecoverable error). By the way, if you let BOINC run, after a number of useless attempts to start sulphur, BOINC gives up and reports failure on the data set (error -161, since no files were created by SULPHUR): Starting model ID sulphur_ihfj_000862399 Phase 1 Waiting for model startup, this may take a minute... Model timeout at 180.00 seconds Preparing for restart... Rewinding a model-day... Starting model ID sulphur_ihfj_000862399 Phase 1 Waiting for model startup, this may take a minute... Model timeout at 180.00 seconds ...and so on, and then: 2006-01-04 19:53:58 [climateprediction.net] Unrecoverable error for result sulphur_ihfj_000862399_0 (<file_xfer_error> <file_name>sulphur_ihfj_000862399_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_ihfj_000862399_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_ihfj_000862399_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_ihfj_000862399_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_ihfj_000862399_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> Regards, Stefan. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
I have some more information relating to the bug, by running an strace. It seems that sulphur_um_4.22 gives dies by SEGMENTATION FAULT on SHAPE! I have also found the point where it dies in the strace (see log below) [...] getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82 open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005208\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) read(13, \" 1 \"..., 32768) = 122 close(13) = 0 getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82 open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005222\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) read(13, \" 1 \"..., 32768) = 122 close(13) = 0 getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82 open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005223\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) read(13, \" 1 \"..., 32768) = 122 close(13) = 0 getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82 open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01010206\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device) read(13, \" 4 \"..., 32768) = 122 close(13) = 0 write(8, \" UMSETUP; CRUN, read CONTCNTL f\"..., 32564) = 32564 write(8, \" 1 1 72 1\"..., 32553) = 32553 --- SIGSEGV (Segmentation fault) --- On SHADING, however, there is no segmentation fault at this point! The next lines in the STRACE would then be: getcwd(\"/h/10/mstefan/boinc.glenora/BOINC/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 88 open(\"/h/10/mstefan/boinc.glenora/BOINC/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/STASHmaster/STASHmaster_A\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbee85d44) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(13, {st_mode=S_IFREG|0644, st_size=322516, ...}) = 0 ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbee85d44) = -1 ENOTTY (Inappropriate ioctl for device) read(13, \"H1| SUBMODEL_NUMBER=1\\nH2| SUBMOD\"..., 32768) = 32768 read(13, \"110 |CLOUD SOOT MASS MIX RAT AFT\"..., 32750) = 32750 _llseek(13, 0, [65518], SEEK_CUR) = 0 _llseek(13, -65475, [43], SEEK_CUR) = 0 read(13, \"OS\\nH3| UM_VERSION=4.5\\n#\\n#|Model \"..., 32767) = 32767 read(13, \"\\n2| 2 | 0 | 1 | 1 | \"..., 32708) = 32708 read(13, \" 1 | 2 | 205 |OUTGOING LW R\"..., 32764) = 32764 _llseek(13, 0, [98282], SEEK_CUR) = 0 _llseek(13, -65475, [32807], SEEK_CUR) = 0 read(13, \"P |\\n2| 2 | 0 | 1 | 1\"..., 32767) = 32767 read(13, \"\\n2| 0 | 0 | 2 | 1 | \"..., 32708) = 32708 _llseek(13, 0, [98282], SEEK_CUR) = 0 [...and so on...] Notes: * The trace was done using sulphur_4.22 because I have just lost my 4.23 data set (because of multiple failure testing on different machines) and the server is down so I can\'t get a new dataset for it. However, a similar trace exists for it (and as soon as the project will be up again I shall run a trace on that one) * The command for the trace was: strace -o trace -ff boinc * The log fragments above correspond to the trace for the process sulphur_um_4.22. * The function call failures (Inappropriate ioctl for device) appear both in the log for SHAPE and SHADING, thus seem to be a normal thing. Hence, our problem is now restated as following: On some linux machines, sulphur_um_4.22 crashes with segmentation fault. The point where it does is just before opning the file datain/ancil/ctldata/STASHmaster/STASHmaster_A in the data set. I hope this helps the authors find the bug. Cheers, Stefan. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Latest news: Found a difference between SHADING and all the other computers! The following lines from the strace log file show what libraries sulphur is binding to: open(\"/lib/i686/mmx/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(\"/lib/i686/mmx\", 0xbfffefc0) = -1 ENOENT (No such file or directory) open(\"/lib/i686/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(\"/lib/i686\", 0xbfffefc0) = -1 ENOENT (No such file or directory) open(\"/lib/mmx/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory) stat64(\"/lib/mmx\", 0xbfffefc0) = -1 ENOENT (No such file or directory) open(\"/lib/libm.so.6\", O_RDONLY) = 6 As you can see, it is looking for the file \"libm.so.6\". There are also similar searches for \"libc.so.6\" and \"libpthred.so.0\". Now, I looked for the versions of these files and found the following: * All stations except SHADING are using: libm version 2.3.2 libc version 2.3.2 pthread version 0.10 * SHADING is using libm version 2.2.5 libc version 2.2.5 pthread version 0.9 So, the hypothesis would be: Maybe SULPHUR does not work well (gives segmentation fault) with the library version combination: LIBM 2.3.2, LIBC 2.3.2, LIBPTHREAD 0.10. On the other hand, it works well with (older) combination: LIBM 2.2.5, LIBC 2.2.5, LIBPTHREAD 0.9. It\'s just a hypothesis, but up to this point it\'s the only difference I could find. Cheers again, Stefan. |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Hypothesis confirmed: found another computer that works. MARS has also versions LIBC 2.2.5, LIBM 2.2.5, PTHREAD 0.9! Hence, I believe this is it: *************************** SULPHUR_UM does not work (gives segmentation fault) with the new library versions LIBM 2.3.2, LIBC 2.3.2 and PTHREAD 0.10. *************************** I promise to stop flooding you with message now that it seems we\'ve found the answer, cross my heart! Cheers, Stefan. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Great piece of detective work, Stefan. Now to see what Tolu thinks. |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
Perhaps the new sulphur apps were compiled on a too \"up-to-date\" machine, containing all the latest libraries. AFAIK, Carl had this problem in the Beta Sulphur (or was it Spinup, I don\'t remember well) (compiling on a Sarge created error messages and reverting to a previous version of Debian solved some problems), but I don\'t know if it is the case now. Arnaud |
Send message Joined: 28 Sep 04 Posts: 36 Credit: 268,150 RAC: 0 |
Perhaps the new sulphur apps were compiled on a too \"up-to-date\" machine, containing all the latest libraries. Yes, but in this case it seems to be the other way around: sulphur works well with the old libraries, but does not work well with the new ones. Would this mean that the new versions of the libraries are not entirely backward compatible? Or that sulphur somewhere assumes something about the behaviour of some library functions that is not in the specification and thus is not maintained in the new version? Well, only the authors of the app. will be able to tell us, I guess... |
©2024 cpdn.org