climateprediction.net (CPDN) home page
Thread 'Stability Problems on SMP Linux?'

Thread 'Stability Problems on SMP Linux?'

Questions and Answers : Unix/Linux : Stability Problems on SMP Linux?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user6656

Send message
Joined: 31 Aug 04
Posts: 2
Credit: 171,502
RAC: 0
Message 2592 - Posted: 1 Sep 2004, 22:07:01 UTC

Seems that CP / boinc (4.05) is somewhat instable in SMP Kernel (P4 HyperThreading). (Suse 9.1; 2.6.5-7.104-smp). The problem does not exist on uniprocessor Machines.
I already tried do detach / reatach the machine, to get a fresh version of the cp clients. But that does not change anything. The downloaded files are identical to the ones on my uniprocessor machines:

hadsm3_4.03_i686-pc-linux-gnu
hadsm3se_4.03_i686-pc-linux-gnu
hadsm3um_4.03_i686-pc-linux-gnu

The log only says this (alternating)
....
Model timeout at 180.00 seconds
Model crashed...retrying...restart level 0
Preparing for restart...
Rewinding a model-day...
Starting model ID 05x4_000032685 Phase 1
Stack size=4096.00 MB
Waiting for model startup, this may take a minute...
05x4_000032685 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
Model timeout at 180.00 seconds
Model crashed...retrying...restart level 1
Preparing for restart...
Rewinding a model-month...
Error: Restart files for dataout/restart.month not found
Giving up, this result exceeded crash count for available restart files.
adding: ncatts.cpdc (deflated 72%)
adding: climate.cont (deflated 79%)
adding: climate.cpdc (deflated 79%)
adding: climate.doub (deflated 79%)
adding: climate.spin (deflated 79%)
adding: 05x4_000032685.xml (deflated 65%)
adding: ncatts.cpdc (deflated 72%)
adding: ncatts.cpdc (deflated 72%)
adding: ncatts.cpdc (deflated 72%)
adding: stderr_um.txt (deflated 75%)
adding: yabsd.out (deflated 93%)
adding: restart.day (deflated 43%)
2004-09-02 00:05:09 [climateprediction.net] Unrecoverable error for result 05x4_000032685_0 (process exited with code 251 (0xfb))


Top tells me that a process is defunctional:
26381 distrib 34 19 3480 1512 2776 S 0.0 0.3 0:00.24 hadsm3_4.03_i68
26543 distrib 34 19 0 0 0 Z 0.0 0.0 0:00.31 hadsm3um_4.03_i

ID: 2592 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 2623 - Posted: 2 Sep 2004, 2:24:12 UTC

Hi, Frank,

FWIW. I tried SuSE 9.1 Pro and it failed to recognize the second half of HT CPU. Makes it useless, eh? (Tried on two similar boxes.) Then, I retro-graded to SuSE 9.0 Personal, which runs on three machines. (There were also backup failures, where the backups lost files, including ALL email, bookmarks, and address book.) I'm not surprised at any failures of 9.1.

Both boxes are P4 3.0 on ASUS P4P800 MB.

I hope you are not also being bitten by SuSE 9.1!


________________________________________________
Indeed I tremble for my country when I reflect that God is just.
-- Thomas Jefferson
ID: 2623 · Report as offensive     Reply Quote
old_user6656

Send message
Joined: 31 Aug 04
Posts: 2
Credit: 171,502
RAC: 0
Message 2889 - Posted: 3 Sep 2004, 16:15:29 UTC - in response to Message 2623.  


Hi

> I hope you are not also being bitten by SuSE 9.1!

I hope not to bee, and yes the kernel version provided on the DVD did not work for SMP on P4 HT. Problems went from finding it, but not using it, up to freezing the system. But up to now some updates took place....

And I do not consider this to be the problem. No other program I am using has any problem with the HT Kernel - especially seti boinc is up fine. The Kernel itself is working, and reporting the "two" cpus properly
ID: 2889 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 2902 - Posted: 3 Sep 2004, 17:12:30 UTC

Hi Frank,

Your problems <i>might</i> be related to a Visual Fortran error that's been afflicting the windows build recently. Seems that some workunits have gone out with a duff file.

Check out <a href="http://www.climateprediction.net/board/viewtopic.php?t=2296&amp;p=20006#20006">this thread</a> on the phpBB forum.

And thanks to <b>sjokela</b> for doing the investigative work and <b>UK_Nick</b> for providing a link to the file that gives a workaround for the problem :)

<a href="http://www.teampicard.net"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 2902 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Stability Problems on SMP Linux?

©2024 cpdn.org