Questions and Answers : Unix/Linux : Stability Problems on SMP Linux?
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 2 Credit: 171,502 RAC: 0 |
Seems that CP / boinc (4.05) is somewhat instable in SMP Kernel (P4 HyperThreading). (Suse 9.1; 2.6.5-7.104-smp). The problem does not exist on uniprocessor Machines. I already tried do detach / reatach the machine, to get a fresh version of the cp clients. But that does not change anything. The downloaded files are identical to the ones on my uniprocessor machines: hadsm3_4.03_i686-pc-linux-gnu hadsm3se_4.03_i686-pc-linux-gnu hadsm3um_4.03_i686-pc-linux-gnu The log only says this (alternating) .... Model timeout at 180.00 seconds Model crashed...retrying...restart level 0 Preparing for restart... Rewinding a model-day... Starting model ID 05x4_000032685 Phase 1 Stack size=4096.00 MB Waiting for model startup, this may take a minute... 05x4_000032685 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00 Model timeout at 180.00 seconds Model crashed...retrying...restart level 1 Preparing for restart... Rewinding a model-month... Error: Restart files for dataout/restart.month not found Giving up, this result exceeded crash count for available restart files. adding: ncatts.cpdc (deflated 72%) adding: climate.cont (deflated 79%) adding: climate.cpdc (deflated 79%) adding: climate.doub (deflated 79%) adding: climate.spin (deflated 79%) adding: 05x4_000032685.xml (deflated 65%) adding: ncatts.cpdc (deflated 72%) adding: ncatts.cpdc (deflated 72%) adding: ncatts.cpdc (deflated 72%) adding: stderr_um.txt (deflated 75%) adding: yabsd.out (deflated 93%) adding: restart.day (deflated 43%) 2004-09-02 00:05:09 [climateprediction.net] Unrecoverable error for result 05x4_000032685_0 (process exited with code 251 (0xfb)) Top tells me that a process is defunctional: 26381 distrib 34 19 3480 1512 2776 S 0.0 0.3 0:00.24 hadsm3_4.03_i68 26543 distrib 34 19 0 0 0 Z 0.0 0.0 0:00.31 hadsm3um_4.03_i |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Hi, Frank, FWIW. I tried SuSE 9.1 Pro and it failed to recognize the second half of HT CPU. Makes it useless, eh? (Tried on two similar boxes.) Then, I retro-graded to SuSE 9.0 Personal, which runs on three machines. (There were also backup failures, where the backups lost files, including ALL email, bookmarks, and address book.) I'm not surprised at any failures of 9.1. Both boxes are P4 3.0 on ASUS P4P800 MB. I hope you are not also being bitten by SuSE 9.1! ________________________________________________ Indeed I tremble for my country when I reflect that God is just. -- Thomas Jefferson |
Send message Joined: 31 Aug 04 Posts: 2 Credit: 171,502 RAC: 0 |
Hi > I hope you are not also being bitten by SuSE 9.1! I hope not to bee, and yes the kernel version provided on the DVD did not work for SMP on P4 HT. Problems went from finding it, but not using it, up to freezing the system. But up to now some updates took place.... And I do not consider this to be the problem. No other program I am using has any problem with the HT Kernel - especially seti boinc is up fine. The Kernel itself is working, and reporting the "two" cpus properly |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Hi Frank, Your problems <i>might</i> be related to a Visual Fortran error that's been afflicting the windows build recently. Seems that some workunits have gone out with a duff file. Check out <a href="http://www.climateprediction.net/board/viewtopic.php?t=2296&p=20006#20006">this thread</a> on the phpBB forum. And thanks to <b>sjokela</b> for doing the investigative work and <b>UK_Nick</b> for providing a link to the file that gives a workaround for the problem :) <a href="http://www.teampicard.net"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
©2024 cpdn.org