Thread 'Multiple processes on singlecore'

Author	Message
Kenneth Larsen Send message Joined: 26 Aug 04 Posts: 59 Credit: 438,133 RAC: 0	Message 35301 - Posted: 18 Oct 2008, 12:11:58 UTC I have this Hadsm model running on my Linux computer, using version 6.0 of the climate app and Boinc v5.10.45. Lately, at about 25%, the model has started spawning several instances of the app, all sharing cpu time and memory. Right now I have 8 processes runnung, and when I try to restart Boinc they start coming back little by little. Is this normal? The model seems to have slowed down to a crawl because of this. Until I get this solved, I\'ll leave the model suspended. Regards, Kenneth ID: 35301 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 35313 - Posted: 18 Oct 2008, 20:34:57 UTC Last modified: 18 Oct 2008, 20:35:33 UTC That\'s a hadsm3mh model Kenneth. They have 4 phases and start post-processing the first phase at 25%, at which point it\'ll start spawning multiple instances of the hadsm3mh_se_6.00_* program (but only one should be running at any time). Having said that the task has already returned 3 trickles from phase 2, so I\'d expect it to be somewhere between 28 and 29% and only running a single instance each of hadsm3mh_6.00_* and hadsm3mh_um_6.00_. What are the exact process names that are running? Also check the projects/climateprediction.net/hadsm3mh_kk0u_005999698/dataout directory. If phase 1 post-processing has completed all of the kk0uaa. files should end with .x1.nc Then open up the file projects/climateprediction.net/hadsm3mh_kk0u_005999698.xml and check the values of the <PH>, <TS> and <TR> tags plus all the tags after <TR> and before the <RSD attr=\"0\"> tag (sorry I can\'t be more precise, but I don\'t have any hadsm3mh tasks running). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 35313 · Reply Quote

Kenneth Larsen Send message Joined: 26 Aug 04 Posts: 59 Credit: 438,133 RAC: 0	Message 35315 - Posted: 18 Oct 2008, 21:08:52 UTC Last modified: 18 Oct 2008, 21:10:36 UTC Hello Thyme, Not all files in projects/climateprediction.net/hadsm3mh_kk0u_005999698/dataout end with .x1.nc, there a still a few without an ending (pa26* ... pb.27* and similar). The .xml has the following in the parts you mentioned: PH: 2 TS: 38737 TR: 32406 and then ST: 1 RS: 3 RSC: 0 RSDT: 38632 RSMT: 37296 RSYT: 34416 All running parallel processes are called hadsm3mh_um_6.0 and as I say, they are all sharing the cpu equally. The odd thing is, if I suspend the model one of the processes continues and has to be killed manually. When I resume the model, it starts with just one process but within some hours it begins spawning more, up to about 7 or 8. The model is right now at 28,751%. As you can also see by the latest trickles, the crunching has slowed down considerably in the last 2 days; this is the only project currently running on the machine, and it is on 24 hours a day and hasn\'t been used much by me. I really don\'t want to have to kill the model and appreciate your help. Regards, Kenneth ID: 35315 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 35317 - Posted: 18 Oct 2008, 22:45:45 UTC Last modified: 18 Oct 2008, 22:49:49 UTC As long as the files in dataout which don\'t end with .x1.nc all start with kk0uba the files are in the correct state for your current timestep. Having more than one instance of the hadsm3mh_um_6.0 worker program running normally indicates that the hadsm3mh_6.0 controller process has terminated without killing the worker it spawned (with v5 applications that would have resulted in the task being reported as failed with exit status 1 when the second um process started). As you\'re running Linux that\'s easy enough to check; all the orphaned um processes should have an inherited ppid of 1. Check your BOINC directory for anything strange in stdoutdae.txt and sterrdae.txt. It\'s also worth having a look at slots/<n>/stderr.txt and projects/climateprediction.net/hadsm3mh_kk0u_005999698/std* "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 35317 · Reply Quote

Kenneth Larsen Send message Joined: 26 Aug 04 Posts: 59 Credit: 438,133 RAC: 0	Message 35320 - Posted: 19 Oct 2008, 9:13:12 UTC Well, it seems to have solved itself for now; during the night it has been running and this morning only one process was running still. I\'ll keep monitoring it and report here if it happens again. Thanks for the help anyway, Thyme. ID: 35320 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 35321 - Posted: 19 Oct 2008, 10:07:42 UTC Look at the model\'s graphics every couple of days. If you can see all the colours for each display mode, eg red for hot areas and blue for cold, this will probably indicate healthy processing. But if the displays are monochrome eg the entire globe is blue, something will be wrong. Cpdn news ID: 35321 · Reply Quote

Kenneth Larsen Send message Joined: 26 Aug 04 Posts: 59 Credit: 438,133 RAC: 0	Message 35323 - Posted: 19 Oct 2008, 11:52:52 UTC Unfortunately this isn\'t possible for me: most of my machines are just a mainboard and cou and memory, I use ssh to control them. Even on the one that has a monitor the \"show graphics\" button is greyed out after having done the xhost +local: thing in the console. However, it seems to be running fine now. ID: 35323 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 35326 - Posted: 19 Oct 2008, 16:12:22 UTC Last modified: 19 Oct 2008, 16:15:56 UTC In that case, check the model\'s trickles and sec/timestep from time to time on its web page to make sure the processing isn\'t getting much faster or slower. The figures shown there are cumulative averages. And after the upload at the end of phase 2, check its temp and precipitation graphs to make sure all the phase 2 data has been processed. If slab models become abnormal there are usually gaps in one graph or both. (Your phase 1 graphs are good.) At least one other member is further ahead with this model than you, so there probably isn\'t a defect in the model even though it\'s behaved badly on your computer. Cpdn news ID: 35326 · Reply Quote

Kenneth Larsen Send message Joined: 26 Aug 04 Posts: 59 Credit: 438,133 RAC: 0	Message 35328 - Posted: 19 Oct 2008, 17:23:27 UTC It seems like I was a bit too fast in saying that the model was progressing normally again: now, after about 24 hours of normal crunching it is starting to act up again. It spawned 4 more hadsm3mh_um_6.0 processes and I had to stop BOINC and kill the remaining processes. If I kill one, the rest disappear too. I\'ll let it continue hoping we can learn something more from it. ID: 35328 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 35330 - Posted: 19 Oct 2008, 18:20:34 UTC Check your personal messages Kenneth. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 35330 · Reply Quote