climateprediction.net (CPDN) home page
Thread 'Multiple processes on singlecore'

Thread 'Multiple processes on singlecore'

Questions and Answers : Unix/Linux : Multiple processes on singlecore
Message board moderation

To post messages, you must log in.

AuthorMessage
Kenneth Larsen

Send message
Joined: 26 Aug 04
Posts: 59
Credit: 438,133
RAC: 0
Message 35301 - Posted: 18 Oct 2008, 12:11:58 UTC

I have this Hadsm model running on my Linux computer, using version 6.0 of the climate app and Boinc v5.10.45.
Lately, at about 25%, the model has started spawning several instances of the app, all sharing cpu time and memory. Right now I have 8 processes runnung, and when I try to restart Boinc they start coming back little by little. Is this normal? The model seems to have slowed down to a crawl because of this.

Until I get this solved, I\'ll leave the model suspended.

Regards,
Kenneth
ID: 35301 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 35313 - Posted: 18 Oct 2008, 20:34:57 UTC
Last modified: 18 Oct 2008, 20:35:33 UTC

That\'s a hadsm3mh model Kenneth. They have 4 phases and start post-processing the first phase at 25%, at which point it\'ll start spawning multiple instances of the hadsm3mh_se_6.00_* program (but only one should be running at any time). Having said that the task has already returned 3 trickles from phase 2, so I\'d expect it to be somewhere between 28 and 29% and only running a single instance each of hadsm3mh_6.00_* and hadsm3mh_um_6.00_*.

What are the exact process names that are running?

Also check the projects/climateprediction.net/hadsm3mh_kk0u_005999698/dataout directory. If phase 1 post-processing has completed all of the kk0uaa.* files should end with .x1.nc

Then open up the file projects/climateprediction.net/hadsm3mh_kk0u_005999698.xml and check the values of the <PH>, <TS> and <TR> tags plus all the tags after <TR> and before the <RSD attr=\"0\"> tag (sorry I can\'t be more precise, but I don\'t have any hadsm3mh tasks running).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 35313 · Report as offensive     Reply Quote
Kenneth Larsen

Send message
Joined: 26 Aug 04
Posts: 59
Credit: 438,133
RAC: 0
Message 35315 - Posted: 18 Oct 2008, 21:08:52 UTC
Last modified: 18 Oct 2008, 21:10:36 UTC

Hello Thyme,

Not all files in projects/climateprediction.net/hadsm3mh_kk0u_005999698/dataout end with .x1.nc, there a still a few without an ending (pa26* ... pb.27* and similar).

The .xml has the following in the parts you mentioned:
PH: 2
TS: 38737
TR: 32406
and then
ST: 1
RS: 3
RSC: 0
RSDT: 38632
RSMT: 37296
RSYT: 34416

All running parallel processes are called hadsm3mh_um_6.0 and as I say, they are all sharing the cpu equally. The odd thing is, if I suspend the model one of the processes continues and has to be killed manually. When I resume the model, it starts with just one process but within some hours it begins spawning more, up to about 7 or 8.
The model is right now at 28,751%.
As you can also see by the latest trickles, the crunching has slowed down considerably in the last 2 days; this is the only project currently running on the machine, and it is on 24 hours a day and hasn\'t been used much by me.

I really don\'t want to have to kill the model and appreciate your help.
Regards,
Kenneth
ID: 35315 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 35317 - Posted: 18 Oct 2008, 22:45:45 UTC
Last modified: 18 Oct 2008, 22:49:49 UTC

As long as the files in dataout which don\'t end with .x1.nc all start with kk0uba the files are in the correct state for your current timestep.

Having more than one instance of the hadsm3mh_um_6.0 worker program running normally indicates that the hadsm3mh_6.0 controller process has terminated without killing the worker it spawned (with v5 applications that would have resulted in the task being reported as failed with exit status 1 when the second um process started). As you\'re running Linux that\'s easy enough to check; all the orphaned um processes should have an inherited ppid of 1.

Check your BOINC directory for anything strange in stdoutdae.txt and sterrdae.txt. It\'s also worth having a look at slots/<n>/stderr.txt and projects/climateprediction.net/hadsm3mh_kk0u_005999698/std*
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 35317 · Report as offensive     Reply Quote
Kenneth Larsen

Send message
Joined: 26 Aug 04
Posts: 59
Credit: 438,133
RAC: 0
Message 35320 - Posted: 19 Oct 2008, 9:13:12 UTC

Well, it seems to have solved itself for now; during the night it has been running and this morning only one process was running still. I\'ll keep monitoring it and report here if it happens again.

Thanks for the help anyway, Thyme.
ID: 35320 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35321 - Posted: 19 Oct 2008, 10:07:42 UTC

Look at the model\'s graphics every couple of days. If you can see all the colours for each display mode, eg red for hot areas and blue for cold, this will probably indicate healthy processing. But if the displays are monochrome eg the entire globe is blue, something will be wrong.
Cpdn news
ID: 35321 · Report as offensive     Reply Quote
Kenneth Larsen

Send message
Joined: 26 Aug 04
Posts: 59
Credit: 438,133
RAC: 0
Message 35323 - Posted: 19 Oct 2008, 11:52:52 UTC

Unfortunately this isn\'t possible for me: most of my machines are just a mainboard and cou and memory, I use ssh to control them. Even on the one that has a monitor the \"show graphics\" button is greyed out after having done the xhost +local: thing in the console.
However, it seems to be running fine now.
ID: 35323 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35326 - Posted: 19 Oct 2008, 16:12:22 UTC
Last modified: 19 Oct 2008, 16:15:56 UTC

In that case, check the model\'s trickles and sec/timestep from time to time on its web page to make sure the processing isn\'t getting much faster or slower. The figures shown there are cumulative averages. And after the upload at the end of phase 2, check its temp and precipitation graphs to make sure all the phase 2 data has been processed. If slab models become abnormal there are usually gaps in one graph or both. (Your phase 1 graphs are good.)

At least one other member is further ahead with this model than you, so there probably isn\'t a defect in the model even though it\'s behaved badly on your computer.
Cpdn news
ID: 35326 · Report as offensive     Reply Quote
Kenneth Larsen

Send message
Joined: 26 Aug 04
Posts: 59
Credit: 438,133
RAC: 0
Message 35328 - Posted: 19 Oct 2008, 17:23:27 UTC

It seems like I was a bit too fast in saying that the model was progressing normally again: now, after about 24 hours of normal crunching it is starting to act up again. It spawned 4 more hadsm3mh_um_6.0 processes and I had to stop BOINC and kill the remaining processes. If I kill one, the rest disappear too.

I\'ll let it continue hoping we can learn something more from it.
ID: 35328 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 35330 - Posted: 19 Oct 2008, 18:20:34 UTC

Check your personal messages Kenneth.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 35330 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Multiple processes on singlecore

©2024 cpdn.org