climateprediction.net home page
new app. 4.23 resolves signal 11 bug

new app. 4.23 resolves signal 11 bug

Questions and Answers : Unix/Linux : new app. 4.23 resolves signal 11 bug
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
old_user3

Send message
Joined: 5 Aug 04
Posts: 173
Credit: 1,843,046
RAC: 0
Message 18998 - Posted: 4 Jan 2006, 17:44:30 UTC

The new app version 4.23 resolves the signal 11 error reported. also speed improvement in on intel procs.
ID: 18998 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19002 - Posted: 4 Jan 2006, 18:28:55 UTC

Great news and congratulations to the software development team!

However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet been addressed.

This bug seems to affect mostly Redhat distributions. It is very deterministic. On some machines, the sulphur model simply won\'t start. It hangs in a kind of deadlock state, without any CPU consumption.

I have tested this bug on 7 machines with different kernel versions and performances. Only 1 machine does not suffer from this strange bug, but I was not yet unable to determine what is different on that machine than the others (there was another machines with then same kernel version which did not work, for example). This shows hoewever that this IS a very serious bug, rendering the application completely unusable on many Linux machines.

Cheers,
Stefan.
ID: 19002 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 19004 - Posted: 4 Jan 2006, 18:53:54 UTC - in response to Message 19002.  

However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet been addressed.

Could you give us a link to the other threads that may have talked about this, or at least more documentation of the error. I\'ve had no problem starting sulphur on 3 different Fedora Core installations (1 FC3 and 2 FC4).
ID: 19004 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19005 - Posted: 4 Jan 2006, 19:50:35 UTC - in response to Message 19004.  

However, I would like to raise a signal of alarm here: It\'s been now 4 versions (4.19 -> to date) and the \"model won\'t start bug\" has not yet

been addressed.

Could you give us a link to the other threads that may have talked about this, or at least more documentation of the error. I\'ve had no problem starting

sulphur on 3 different Fedora Core installations (1 FC3 and 2 FC4).


Well, I think this thread is related to it:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=3737

Yes, I would gladly help out and give you as many details as possible about the machines. I am only a starter on Linux, so if you could tell me any

particular tests (like for example certain commands) I should run on the machines, I would gladly run them and give you their outputs. Up to this point, all

I know is to query the kernel version by means of the \"uname -a\" command. (I think a real Linux Guru will have a lot of laugh when he will read this).

The machines I am testing on are part of my University\'s network (Apart from my own machine, I will not run the project on those machines since they don\'t

belong to me, but I can use them to test and try to figure out the problem).

Here are some stats I have:

Machine name: glenora
Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.00GHz
Memory: 500.84 MB physical, 1.00 GB virtual
uname output: Linux glenora.cs 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown


Machine name: shading
Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.60GHz
Memory: 1.97 GB physical, 1.00 GB virtual
uname output: Linux shading.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown


Machine name: shape
Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.60GHz
Memory: 1.97 GB physical, 1.00 GB virtual
uname output: Linux shape.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown


Machine name: mlp1
Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.53GHz
Memory: 501.61 MB physical, 1.00 GB virtual
uname output: Linux mlp1.ai 2.4.20-30.7.legacy #1 Fri Feb 20 10:46:44 PST 2004 i686 unknown


Machine name: bauda
Machine type: Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 1.90GHz
Memory: 500.84 MB physical, 1.00 GB virtual
uname output: Linux bauda.vis 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown


Notes:

* Out of all these machines, \"shading\" is the only one that works.
* Up to this point, I\'ve been running HADSM models and it all worked fine. Only when I switched to sulphur the problems appeared.

Please tell what more information your require, and I will do my best to get it to for you.

Warm regards,
Stefan.
ID: 19005 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 19012 - Posted: 4 Jan 2006, 23:45:53 UTC
Last modified: 4 Jan 2006, 23:48:13 UTC

I\'m not Linux guru either. Try this on shading and shape, since they look so similar and one works and the other doesn\'t:

ulimit -a

and paste the output from those two PCs in your reply. This is what I got on one of my PCs that has had no problems with starting sulphur (P4 3.4 GHz, 512 MB RAM, Mandrake 10.1):

root@localhost root# ulimit -a
core file size (blocks, -c) 1000000
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4095
virtual memory (kbytes, -v) unlimited
ID: 19012 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19013 - Posted: 5 Jan 2006, 0:15:50 UTC - in response to Message 19012.  

[quote]I\'m not Linux guru either. Try this on shading and shape, since they look so similar and one works and the other doesn\'t:

Ok, here they are:

shape:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited

shading:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited

They seem identical :(. Yet I re-checked just now (for the 3rd time), in order to make sure that I don\'t mislead you:

Indeed, model gets deadlock on SHAPE, works fine on SHADING.

I keep wandering then, what is different between these two computers? The SAME executable works fine on one , gets deadlocked on the other... Strange...

Well, Geophi, please tell me what other tests we might carry on next.

Cheers,
Stefan.
ID: 19013 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 19014 - Posted: 5 Jan 2006, 0:22:08 UTC - in response to Message 19013.  

Indeed, model gets deadlock on SHAPE, works fine on SHADING.

I keep wandering then, what is different between these two computers? The SAME executable works fine on one , gets deadlocked on the other... Strange...

Well, Geophi, please tell me what other tests we might carry on next.

My lack of guruness is showing. I\'m afraid at this point we will have to wait for a real guru.
ID: 19014 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19015 - Posted: 5 Jan 2006, 0:35:09 UTC - in response to Message 19014.  

Hi, Geophi. I don\'t know if it is of any help, but just in case...

Here are also two process snapshots (ps -Alw), while running sulphur.

The first one is taken on SHAPE, the second on SHADING.

F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
100 S 0 1 0 0 68 0 - 331 do_sel ? 00:00:04 init
040 S 0 2 1 0 69 0 - 0 contex ? 00:00:00 keventd
040 S 0 3 1 0 69 0 - 0 apm_ma ? 00:00:00 kapmd
040 S 0 4 1 0 79 19 - 0 ksofti ? 00:00:00 ksoftirqd_CPU0
040 S 0 5 1 0 69 0 - 0 kswapd ? 00:00:00 kswapd
040 S 0 6 1 0 69 0 - 0 bdflus ? 00:00:00 bdflush
040 S 0 7 1 0 69 0 - 0 kupdat ? 00:00:00 kupdated
040 S 0 9 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_2
040 S 0 10 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_3
040 S 0 11 1 0 69 0 - 0 usb_hu ? 00:00:00 khubd
040 S 0 16 1 0 59 -20 - 0 md_thr ? 00:00:00 mdrecoveryd
040 S 0 17 1 0 69 0 - 0 kjourn ? 00:00:04 kjournald
040 S 0 147 1 0 69 0 - 0 kjourn ? 00:00:00 kjournald
040 S 0 446 1 0 69 0 - 331 nanosl ? 00:00:00 dhcpcd
140 S 0 551 1 0 69 0 - 355 do_sel ? 00:00:01 syslogd
140 S 0 556 1 0 69 0 - 330 do_sys ? 00:00:00 klogd
140 S 1 576 1 0 69 0 - 404 do_pol ? 00:00:00 portmap
140 S 29 604 1 0 69 0 - 378 do_sel ? 00:00:00 rpc.statd
140 S 0 719 1 0 68 0 - 328 do_sel ? 00:00:00 apmd
040 S 0 808 1 0 68 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 810 1 0 68 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 826 1 0 69 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 847 1 0 69 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 868 1 0 68 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 924 1 0 69 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 926 1 0 69 0 - 356 pipe_w ? 00:00:00 automount
040 S 0 928 1 0 69 0 - 356 pipe_w ? 00:00:00 automount
140 S 0 982 1 0 69 0 - 491 do_sel ? 00:00:00 xinetd
140 S 0 1003 1 0 69 0 - 355 do_sel ? 00:00:00 lpd
140 S 0 1022 1 0 69 0 - 337 wait_f ? 00:00:22 rwhod
040 S 0 1026 1022 0 69 0 - 338 nanosl ? 00:00:00 rwhod
040 S 0 1474 1 0 68 0 - 342 nanosl ? 00:00:00 crond
040 S 1 1599 1 0 69 0 - 339 nanosl ? 00:00:00 atd
140 S 0 1620 1 0 69 0 - 576 do_sel ? 00:00:00 gated
100 S 0 1680 1 0 69 0 - 547 do_sel ? 00:00:00 master
100 S 1026 1688 1680 0 69 0 - 570 do_sel ? 00:00:00 qmgr
140 S 0 1691 1 0 69 0 - 623 do_sel ? 00:00:07 sshd
100 S 0 1716 1 0 69 0 - 325 read_c tty1 00:00:00 mingetty
100 S 0 1717 1 0 69 0 - 325 read_c tty2 00:00:00 mingetty
100 S 0 1718 1 0 69 0 - 325 read_c tty3 00:00:00 mingetty
100 S 0 1719 1 0 69 0 - 325 read_c tty4 00:00:00 mingetty
100 S 0 1720 1 0 69 0 - 325 read_c tty5 00:00:00 mingetty
100 S 0 1721 1 0 69 0 - 325 read_c tty6 00:00:00 mingetty
140 S 0 1725 1 0 69 0 - 794 do_sel ? 00:00:00 xdm
140 S 0 6488 1691 0 69 0 - 1308 unix_s ? 00:00:00 sshd
140 S 1915 6490 6488 0 69 0 - 1309 do_sel ? 00:00:00 sshd
000 S 1915 6491 6490 0 69 0 - 693 rt_sig pts/0 00:00:00 tcsh
040 S 0 6495 1 0 69 0 - 0 rpciod ? 00:00:00 rpciod
040 S 0 6496 1 0 69 0 - 0 svc_re ? 00:00:00 lockd
140 S 0 6739 1691 0 69 0 - 1286 unix_s ? 00:00:00 sshd
140 S 1814 6744 6739 0 69 0 - 1356 do_sel ? 00:00:28 sshd
000 S 1814 6746 6744 0 69 0 - 522 wait4 pts/2 00:00:00 bash
000 S 1814 6779 6746 0 69 0 - 543 wait4 pts/2 00:00:00 bash
000 S 1814 6966 6779 0 69 0 - 76638 do_sel pts/2 00:00:03 MATLAB
000 S 1814 7016 6966 0 69 0 - 406 pipe_w pts/3 00:00:00 matlab_helper
040 S 1814 7017 6966 0 68 0 - 76638 do_pol pts/2 00:00:00 MATLAB
040 S 1814 7018 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB
040 S 1814 7019 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB
040 S 1814 7020 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7021 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7022 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7023 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7024 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7025 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB
040 S 1814 7026 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7027 7017 0 69 0 - 76638 do_pol pts/2 00:00:00 MATLAB
040 S 1814 7028 7017 0 69 0 - 76638 do_sel pts/2 00:00:00 MATLAB
040 S 1814 7029 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7031 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7035 7017 0 69 0 - 76638 rt_sig pts/2 00:00:04 MATLAB
040 S 1814 7036 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7037 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7038 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
040 S 1814 7042 7017 0 69 0 - 76638 nanosl pts/2 00:00:00 MATLAB
040 S 1814 7044 7017 0 69 0 - 76638 rt_sig pts/2 00:00:00 MATLAB
100 S 1026 7212 1680 0 69 0 - 557 do_sel ? 00:00:00 pickup
000 S 1915 7875 6491 0 69 0 - 581 wait4 pts/0 00:00:00 bash
100 S 0 8125 1725 0 69 0 - 20162 do_sel ? 00:00:00 X
040 S 0 8126 1725 0 69 0 - 854 do_sel ? 00:00:00 xdm
000 S 1814 8205 6779 0 69 0 - 5970 wait4 pts/2 00:00:00 dm
140 S 0 8302 1691 0 69 0 - 1306 unix_s ? 00:00:00 sshd
140 S 1915 8304 8302 0 69 0 - 1287 do_sel ? 00:00:00 sshd
000 S 1915 8305 8304 0 71 0 - 693 rt_sig pts/1 00:00:00 tcsh
100 S 59997 8319 982 0 69 0 - 369 wait_f ? 00:00:00 spingd
000 S 1915 8359 7875 0 69 0 - 958 do_sel pts/0 00:00:00 boinc
000 S 1915 8360 8359 0 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6
040 S 1915 8361 8360 0 79 19 - 4492 do_pol pts/0 00:00:00 sulphur_4.23_i6
040 S 1915 8362 8361 2 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6
040 S 1915 8363 8361 0 79 19 - 4492 nanosl pts/0 00:00:00 sulphur_4.23_i6
040 R 1814 8364 8205 99 77 0 - 10710 - pts/2 00:00:03 dm
000 R 1915 8365 8362 0 79 19 - 3981 - pts/0 00:00:00 sulphur_um_4.23
000 R 1915 8366 8305 0 73 0 - 834 - pts/1 00:00:00 ps


F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
100 S 0 1 0 0 68 0 - 343 do_sel ? 00:00:04 init
040 S 0 2 1 0 69 0 - 0 contex ? 00:00:00 keventd
040 S 0 3 1 0 69 0 - 0 apm_ma ? 00:00:00 kapmd
040 S 0 4 1 0 79 19 - 0 ksofti ? 00:00:00 ksoftirqd_CPU0
040 S 0 5 1 0 69 0 - 0 kswapd ? 00:00:00 kswapd
040 S 0 6 1 0 69 0 - 0 bdflus ? 00:00:00 bdflush
040 S 0 7 1 0 69 0 - 0 kupdat ? 00:00:00 kupdated
040 S 0 9 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_2
040 S 0 10 1 0 69 0 - 0 down_i ? 00:00:00 scsi_eh_3
040 S 0 11 1 0 69 0 - 0 usb_hu ? 00:00:00 khubd
040 S 0 16 1 0 59 -20 - 0 md_thr ? 00:00:00 mdrecoveryd
040 S 0 17 1 0 69 0 - 0 kjourn ? 00:00:06 kjournald
040 S 0 144 1 0 69 0 - 0 kjourn ? 00:00:01 kjournald
040 S 0 451 1 0 69 0 - 342 nanosl ? 00:00:00 dhcpcd
140 S 0 556 1 0 69 0 - 364 do_sel ? 00:00:01 syslogd
140 S 0 561 1 0 69 0 - 341 do_sys ? 00:00:00 klogd
140 S 1 581 1 0 69 0 - 415 do_pol ? 00:00:00 portmap
140 S 29 609 1 0 69 0 - 390 do_sel ? 00:00:00 rpc.statd
140 S 0 725 1 0 68 0 - 340 do_sel ? 00:00:00 apmd
040 S 0 811 1 0 68 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 813 1 0 68 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 831 1 0 69 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 852 1 0 69 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 876 1 0 68 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 897 1 0 69 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 931 1 0 68 0 - 368 pipe_w ? 00:00:00 automount
040 S 0 933 1 0 69 0 - 368 pipe_w ? 00:00:00 automount
140 S 0 986 1 0 69 0 - 535 do_sel ? 00:00:00 xinetd
140 S 0 1008 1 0 68 0 - 366 do_sel ? 00:00:00 lpd
140 S 0 1026 1 0 69 0 - 347 wait_f ? 00:00:27 rwhod
040 S 0 1043 1026 0 69 0 - 348 nanosl ? 00:00:00 rwhod
040 S 0 1501 1 0 68 0 - 384 nanosl ? 00:00:00 crond
040 S 1 1604 1 0 69 0 - 351 nanosl ? 00:00:00 atd
140 S 0 1625 1 0 69 0 - 586 do_sel ? 00:00:00 gated
100 S 0 1685 1 0 69 0 - 559 do_sel ? 00:00:00 master
100 S 1026 1695 1685 0 69 0 - 582 do_sel ? 00:00:00 qmgr
140 S 0 1696 1 0 69 0 - 635 do_sel ? 00:00:09 sshd
100 S 0 1721 1 0 69 0 - 336 read_c tty1 00:00:00 mingetty
100 S 0 1722 1 0 69 0 - 336 read_c tty2 00:00:00 mingetty
100 S 0 1723 1 0 69 0 - 336 read_c tty3 00:00:00 mingetty
100 S 0 1724 1 0 69 0 - 336 read_c tty4 00:00:00 mingetty
100 S 0 1725 1 0 69 0 - 336 read_c tty5 00:00:00 mingetty
100 S 0 1726 1 0 69 0 - 336 read_c tty6 00:00:00 mingetty
140 S 0 1730 1 0 69 0 - 805 do_sel ? 00:00:00 xdm
040 S 0 10552 1 0 69 0 - 0 rpciod ? 00:00:06 rpciod
040 S 0 10553 1 0 69 0 - 0 svc_re ? 00:00:00 lockd
140 S 0 10743 1696 0 69 0 - 1320 unix_s ? 00:00:00 sshd
140 S 1915 10745 10743 0 69 0 - 1324 do_sel ? 00:00:01 sshd
000 S 1915 10746 10745 0 69 0 - 852 rt_sig pts/0 00:00:00 tcsh
000 S 1915 10901 10746 0 69 0 - 566 wait4 pts/0 00:00:00 bash
000 S 1915 10906 10901 0 69 0 - 841 do_sel pts/0 00:00:00 mc
000 S 1915 10908 10906 0 69 0 - 683 rt_sig pts/2 00:00:00 tcsh
000 S 1915 10948 10908 0 74 0 - 552 wait4 pts/2 00:00:00 bash
100 S 1915 10950 10948 0 73 5 - 1013 do_sel pts/2 00:00:00 boinc
000 S 1915 10952 10950 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6
040 S 1915 10953 10952 0 79 19 - 5985 do_pol pts/2 00:00:00 sulphur_4.22_i6
040 S 1915 10954 10953 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6
040 S 1915 10955 10953 0 79 19 - 5985 nanosl pts/2 00:00:00 sulphur_4.22_i6
000 R 1915 10956 10954 98 79 19 - 31574 - pts/2 15:42:07 sulphur_um_4.22
040 S 1915 10957 10956 0 79 19 - 31574 do_pol pts/2 00:00:00 sulphur_um_4.22
040 S 1915 10958 10957 0 79 19 - 31574 nanosl pts/2 00:00:00 sulphur_um_4.22
100 S 0 12312 1730 0 69 0 - 20160 do_sel ? 00:00:00 X
040 S 0 12313 1730 0 68 0 - 864 do_sel ? 00:00:00 xdm
100 S 1026 13355 1685 0 69 0 - 568 do_sel ? 00:00:00 pickup
100 S 59997 13479 986 0 69 0 - 379 wait_f ? 00:00:00 spingd
000 R 1915 13483 10948 0 77 0 - 846 - pts/2 00:00:00 ps


As you can see, on SHAPE, the application does not consume any CPU time and seems to be deadlocked. After 180 seconds, BOINC kills it and gives a failure to

start model message:

Model timeout at 180.00 seconds

On SHADING, the application is running fine - model starup takes only a few seconds. At the moment it\'s been running for about 15 hours - I let it run during last night just for testing, since the computer is now idle due to the holidays, so I don\'t bother anyone by running a job on it.

This seems to be a real mistery.

Cheers,
Stefan.

ID: 19015 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19016 - Posted: 5 Jan 2006, 0:42:02 UTC - in response to Message 19015.  


Oh, Geophi, one more thing.

On my previous listing, SHADING (the working computer) was running SULPHUR 4.22 instead of SULPHUR 4.23 (this was because I made the first experiments before the app has been updated).

However, I have just checked everything works exactly the same when running model 4.23, so the bug is still in 4.23 as it was in 4.22.

Stefan.
ID: 19016 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19018 - Posted: 5 Jan 2006, 0:59:06 UTC

I notice that MATLAB is running on the computer that\'s not working. Coincidence?

ID: 19018 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19019 - Posted: 5 Jan 2006, 1:05:50 UTC - in response to Message 19018.  


Nice observation. Yes, indeed, one of my colleagues is running a MATLAB job on one of the computers.

Unfortunately, I checked on the other computers. There are computers (including my own) which are not running MATLAB at the moment and SULPHUR still doesn\'t work on them.

You think the cause could be some process running in the background?

Cheers,
Stefan.

P.S. Unfortunately, in the case of SHAPE I cannot stop the other process from running (since it is not my own :) ).
ID: 19019 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19021 - Posted: 5 Jan 2006, 1:20:02 UTC

The little Linux I know doesn\'t go down to this level, but from general computer use, it\'s possibly a background process.
But cpdn apps, especially sulphur, are complex beasts, and don\'t like other programs getting in the way while they\'re running.
I\'m not looking forward to the start of the coupled ocean model. There\'s going to a lot tears and wailing, I fear. :(

ID: 19021 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2185
Credit: 64,822,615
RAC: 5,275
Message 19023 - Posted: 5 Jan 2006, 1:41:53 UTC - in response to Message 19019.  

You think the cause could be some process running in the background?

For your general preferences for these computers, do you have \"Do work while computer in use\" set to yes?
ID: 19023 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19025 - Posted: 5 Jan 2006, 2:02:50 UTC - in response to Message 19023.  


Yes, I\'ve just checked the preferences, it is set to \"yes\" (and also checked the XML file: the <run_if_user_active/> tag is there).

But I don\'t think that could have been the problem anyway, since, as far as I understand, that options tells BOINC not to start the sulphur application at all if the user is active (or unload it from memory as he/she becomes active).

In my case, BOINC is TRYING TO START sulphur (but fails to do so, resulting in an unrecoverable error).

By the way, if you let BOINC run, after a number of useless attempts to start sulphur, BOINC gives up and reports failure on the data set (error -161, since no files were created by SULPHUR):

Starting model ID sulphur_ihfj_000862399 Phase 1
Waiting for model startup, this may take a minute...
Model timeout at 180.00 seconds
Preparing for restart...
Rewinding a model-day...
Starting model ID sulphur_ihfj_000862399 Phase 1
Waiting for model startup, this may take a minute...
Model timeout at 180.00 seconds

...and so on, and then:

2006-01-04 19:53:58 [climateprediction.net] Unrecoverable error for result sulphur_ihfj_000862399_0 (<file_xfer_error>
<file_name>sulphur_ihfj_000862399_0_1.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_ihfj_000862399_0_2.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_ihfj_000862399_0_3.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_ihfj_000862399_0_4.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>
<file_xfer_error>
<file_name>sulphur_ihfj_000862399_0_5.zip</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

Regards,
Stefan.
ID: 19025 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19027 - Posted: 5 Jan 2006, 5:07:55 UTC - in response to Message 19025.  


I have some more information relating to the bug, by running an strace.

It seems that sulphur_um_4.22 gives dies by SEGMENTATION FAULT on SHAPE! I have also found the point where it dies in the strace (see log below)

[...]

getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82
open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005208\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
read(13, \" 1 \"..., 32768) = 122
close(13) = 0
getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82
open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005222\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
read(13, \" 1 \"..., 32768) = 122
close(13) = 0
getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82
open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01005223\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
read(13, \" 1 \"..., 32768) = 122
close(13) = 0
getcwd(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 82
open(\"/h/10/mstefan/boinc.glenora/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/stasets/X01010206\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(13, {st_mode=S_IFREG|0644, st_size=122, ...}) = 0
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbffa6b20) = -1 ENOTTY (Inappropriate ioctl for device)
read(13, \" 4 \"..., 32768) = 122
close(13) = 0
write(8, \" UMSETUP; CRUN, read CONTCNTL f\"..., 32564) = 32564
write(8, \" 1 1 72 1\"..., 32553) = 32553
--- SIGSEGV (Segmentation fault) ---



On SHADING, however, there is no segmentation fault at this point! The next lines in the STRACE would then be:


getcwd(\"/h/10/mstefan/boinc.glenora/BOINC/projects/climateprediction.net/sulphur_i4cy_000845458\", 4095) = 88
open(\"/h/10/mstefan/boinc.glenora/BOINC/projects/climateprediction.net/sulphur_i4cy_000845458/datain/ancil/ctldata/STASHmaster/STASHmaster_A\", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 13
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbee85d44) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(13, {st_mode=S_IFREG|0644, st_size=322516, ...}) = 0
ioctl(13, SNDCTL_TMR_TIMEBASE, 0xbee85d44) = -1 ENOTTY (Inappropriate ioctl for device)
read(13, \"H1| SUBMODEL_NUMBER=1\\nH2| SUBMOD\"..., 32768) = 32768
read(13, \"110 |CLOUD SOOT MASS MIX RAT AFT\"..., 32750) = 32750
_llseek(13, 0, [65518], SEEK_CUR) = 0
_llseek(13, -65475, [43], SEEK_CUR) = 0
read(13, \"OS\\nH3| UM_VERSION=4.5\\n#\\n#|Model \"..., 32767) = 32767
read(13, \"\\n2| 2 | 0 | 1 | 1 | \"..., 32708) = 32708
read(13, \" 1 | 2 | 205 |OUTGOING LW R\"..., 32764) = 32764
_llseek(13, 0, [98282], SEEK_CUR) = 0
_llseek(13, -65475, [32807], SEEK_CUR) = 0
read(13, \"P |\\n2| 2 | 0 | 1 | 1\"..., 32767) = 32767
read(13, \"\\n2| 0 | 0 | 2 | 1 | \"..., 32708) = 32708
_llseek(13, 0, [98282], SEEK_CUR) = 0

[...and so on...]


Notes:

* The trace was done using sulphur_4.22 because I have just lost my 4.23 data set (because of multiple failure testing on different machines) and the server is down so I can\'t get a new dataset for it. However, a similar trace exists for it (and as soon as the project will be up again I shall run a trace on that one)
* The command for the trace was:

strace -o trace -ff boinc

* The log fragments above correspond to the trace for the process sulphur_um_4.22.
* The function call failures (Inappropriate ioctl for device) appear both in the log for SHAPE and SHADING, thus seem to be a normal thing.

Hence, our problem is now restated as following:

On some linux machines, sulphur_um_4.22 crashes with segmentation fault. The point where it does is just before opning the file datain/ancil/ctldata/STASHmaster/STASHmaster_A in the data set.

I hope this helps the authors find the bug.

Cheers,
Stefan.

ID: 19027 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19030 - Posted: 5 Jan 2006, 7:03:05 UTC - in response to Message 19027.  

Latest news: Found a difference between SHADING and all the other computers!

The following lines from the strace log file show what libraries sulphur is binding to:

open(\"/lib/i686/mmx/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(\"/lib/i686/mmx\", 0xbfffefc0) = -1 ENOENT (No such file or directory)
open(\"/lib/i686/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(\"/lib/i686\", 0xbfffefc0) = -1 ENOENT (No such file or directory)
open(\"/lib/mmx/libm.so.6\", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(\"/lib/mmx\", 0xbfffefc0) = -1 ENOENT (No such file or directory)
open(\"/lib/libm.so.6\", O_RDONLY) = 6

As you can see, it is looking for the file \"libm.so.6\". There are also similar searches for \"libc.so.6\" and \"libpthred.so.0\".

Now, I looked for the versions of these files and found the following:

* All stations except SHADING are using:

libm version 2.3.2
libc version 2.3.2
pthread version 0.10

* SHADING is using

libm version 2.2.5
libc version 2.2.5
pthread version 0.9

So, the hypothesis would be:

Maybe SULPHUR does not work well (gives segmentation fault) with the library version combination: LIBM 2.3.2, LIBC 2.3.2, LIBPTHREAD 0.10. On the other hand, it works well with (older) combination: LIBM 2.2.5, LIBC 2.2.5, LIBPTHREAD 0.9.

It\'s just a hypothesis, but up to this point it\'s the only difference I could find.

Cheers again,
Stefan.
ID: 19030 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19031 - Posted: 5 Jan 2006, 7:15:16 UTC - in response to Message 19030.  

Hypothesis confirmed: found another computer that works. MARS has also versions LIBC 2.2.5, LIBM 2.2.5, PTHREAD 0.9!

Hence, I believe this is it:

***************************
SULPHUR_UM does not work (gives segmentation fault) with the new library versions LIBM 2.3.2, LIBC 2.3.2 and PTHREAD 0.10.
***************************

I promise to stop flooding you with message now that it seems we\'ve found the answer, cross my heart!

Cheers,
Stefan.
ID: 19031 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19032 - Posted: 5 Jan 2006, 7:28:25 UTC

Great piece of detective work, Stefan.
Now to see what Tolu thinks.

ID: 19032 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 19033 - Posted: 5 Jan 2006, 7:38:24 UTC
Last modified: 5 Jan 2006, 7:42:42 UTC

Perhaps the new sulphur apps were compiled on a too \"up-to-date\" machine, containing all the latest libraries.
AFAIK, Carl had this problem in the Beta Sulphur (or was it Spinup, I don\'t remember well) (compiling on a Sarge created error messages and reverting to a previous version of Debian solved some problems), but I don\'t know if it is the case now.
Arnaud
ID: 19033 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19036 - Posted: 5 Jan 2006, 10:01:16 UTC - in response to Message 19033.  

Perhaps the new sulphur apps were compiled on a too \"up-to-date\" machine, containing all the latest libraries.
AFAIK, Carl had this problem in the Beta Sulphur (or was it Spinup, I don\'t remember well) (compiling on a Sarge created error messages and reverting to a previous version of Debian solved some problems), but I don\'t know if it is the case now.


Yes, but in this case it seems to be the other way around: sulphur works well with the old libraries, but does not work well with the new ones. Would this mean that the new versions of the libraries are not entirely backward compatible? Or that sulphur somewhere assumes something about the behaviour of some library functions that is not in the specification and thus is not maintained in the new version?

Well, only the authors of the app. will be able to tell us, I guess...

ID: 19036 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Unix/Linux : new app. 4.23 resolves signal 11 bug

©2024 cpdn.org