Thread 'Something strange about cpu usage.'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 45799 - Posted: 3 Apr 2013, 15:48:40 UTC Last modified: 3 Apr 2013, 15:52:12 UTC I have two full resolution ocean models running and a regional model in waiting. cpu usage if I am doing nothing else is only 54%! full resolution models are 15649915 and 15548613. Normally running 2 tasks on this machine, a dual core I3 processor cpu usage stays at 100%. If I halt task 15649915 the regional model clicks in and cpu usage goes up to 100%. There is nothing in my settings to throttle cpu or memory usage and if there was it would not do so differently with different tasks. Previously I have had 100%cpu activity with the same two tasks running. Any ideas about what might be happening? I have tried suspending all tasks, exiting BOINC and restarting it then resuming the tasks and nothing changes. Progress on task 15649915 seems to be stuck at 25.368% Not surprisingly as it seems to be using few if any cpu cycles. Looking at the models page, up until the last trickle it was going at the same rate as the other model. ID: 45799 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,337,510 RAC: 10,436	Message 45800 - Posted: 3 Apr 2013, 17:26:36 UTC - in response to Message 45799. Two thoughts about your symptoms. 1. A slow hard drive. In these circumstances the computer could be �I/O- bound� and the CPU will wait for disc I/O to complete. Probably more noticeable with the CM3/CM3 combo, than the CM3/AM3 combo. A) Check if the disc activity light is �on� more than usual. B) Run the hard drive diagnostics. They may pass, but are unlikely to report a slow drive. C) Run a hard drive benchmark. I�ve seen two faulty hard drives exhibit your symptoms with CPDN, but appear �normal� for most applications. 2. CPU throttling or thermal shutdown. There are several ways for the CPU to be throttled, externally and internally, to avoid exceeding Tj or Tcase. A) Check the CPU temperatures frequency and VID. B) If temperatures are high, check that the HSF is not clogged. The i5 CPU in my laptop was running at 97-98 deg C and being throttled by VID and multiplier to keep below Tj, all due to a clogged HSF. Good luck. ID: 45800 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45801 - Posted: 3 Apr 2013, 18:19:46 UTC - in response to Message 45799. Ah, 25.368% - the magic number. It's possible that the model has gone into a loop, so you need to do some meditation. Sit there for a while and stare at the data on the Show Graphics page. If it IS looping, the Time, Date, and possible the Timestep, will all increment for a while, and then jump back to a previous value. Saying OMMMM now and then is optional. ID: 45801 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 45802 - Posted: 3 Apr 2013, 18:20:05 UTC Last modified: 3 Apr 2013, 18:21:07 UTC The task being stuck on 25% is rather suspicious, given that it's a HADCM3N. If it is the only task running and it still makes no progress then it has probably gone into one of a number of HADCM3N decade failure modes and should be aborted. If, however, it makes normal progress (after a BOINC Manager start/stop) then the kind of performance trouble-shooting that hagar describes is in order. [Edit: Les quicker on the draw by 20 s.] ID: 45802 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 45803 - Posted: 3 Apr 2013, 18:38:47 UTC Last modified: 3 Apr 2013, 18:44:03 UTC Thanks for the ideas. I don't think it is cpu throttling - cpu temp is 67C under 100%load and cpu load goes back up to 100% when I suspend this model and the regional one takes over. This does not happen when I suspend the other 3cn model. The also makes me think probably not the hard disk. Time to cross legs with feet on knees.............Seems stuck on timestep 265752 I will give it another hour or two then abort if no change. ID: 45803 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 45805 - Posted: 3 Apr 2013, 21:30:43 UTC No progress - aborting. ID: 45805 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45811 - Posted: 5 Apr 2013, 3:59:19 UTC This is model 15649915. It hadn't trickled since 27 March so it must have been looping for a while. It would have eventually crashed if you hadn't aborted it. Dave, did you actually see it looping in the graphics globe? That's funny - the model's exit status is listed as 203 but the usual exit status for an aborted model is -197. Cpdn news ID: 45811 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 45813 - Posted: 5 Apr 2013, 7:07:25 UTC No I didn't see any looping in graphics mode, timestep seemed to have just stuck at 265752 and didn't change when I looked at it. Reason for not trickles since 27th March however is a more prosaic one. Computer was switched off from then till 3rd April as I was in Scotland for parent's 60th Wedding anniversary. I also looked to see what had happened to other tasks in the work unit but there weren't any! ID: 45813 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,704,964 RAC: 9,670	Message 45814 - Posted: 5 Apr 2013, 8:49:38 UTC - in response to Message 45811. That's funny - the model's exit status is listed as 203 but the usual exit status for an aborted model is -197. Nothing to worry about there. Just David A unilaterally redefining the error codes returned by the client, round about v7.0.25 - but would you be surprised to hear that he didn't think through the consequences for outcomes displayed on web pages? Nope, me neither. 203 is simply the new code for EXIT_ABORTED_VIA_GUI, so it matches Dave's explanation. The easiest place to see a list of the new codes is changeset [72368a6b]. It may be possible to update the CPDN web site to display them correctly by a simple drop-in replacement of 'result.inc' - that's worked at other projects still using older server code, like SIMAP - but CPDN's website may be too different from the current central BOINC code for that to work. Details are in the BOINC forum thread Status 'Cancelled by server' changed, but unfortunately all the references have been broken by the migration to a git repository. Let me know if you need help tracking anything down. ID: 45814 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 45815 - Posted: 5 Apr 2013, 9:52:27 UTC Thanks for the prompt clarification Richard, also I will bookmark the link to the exit codes as that might prove useful in the future. ID: 45815 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45816 - Posted: 5 Apr 2013, 11:41:34 UTC Thanks from me too for the clarification re error codes. I think I'll start a new thread just about this so I'll probably copy your post there, Richard, while leaving the original here. We'll need a reference thread searchable by title. But that's a job for after Monday as I have family coming to stay. Cpdn news ID: 45816 · Reply Quote

Chris Steketee Send message Joined: 19 Aug 08 Posts: 2 Credit: 218,320 RAC: 0	Message 45827 - Posted: 6 Apr 2013, 8:20:42 UTC I'm having what looks like the same problem: for the past few days, my (only) model is making no progress and using no CPU time. It's stuck on step 292,104 of 1,039,392. The process is called hadcm3n_6.07_i686-apple-darwin with arguments hadcm3n_o618_1980_40_008324497 ocean_o618_1980_40_008324497_0 atmos_o618_1980_40_008324497_0 spec3a_sw_3_asol2c_hadcm3 spec3a_lw_3_asol2c_hadcm3 waterfix.ancil.be.32 NAT_VOLC DMSSO2NH3_1900_RCP sulpc_oxidants_19_A2_1990f SPARC_O3_rebuild_1900 Stopping and starting BOINC does not fix it; there are no unusual messages: just Restarting task hadcm3n_o618_1980_40_008324497_0 using hadcm3n version 607. Do I understand that the way to deal with this is to abort the task? Does this mean the 28% of computation done so far is lost? Thanks, Chris ID: 45827 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45828 - Posted: 6 Apr 2013, 9:42:59 UTC Hi Chris Thanks for the details. Your model is here. On its web page it's recorded 10 trickles which is one decade of the four. These models are rather fragile at the decadal points ie at 25%, 50% and 75%. It had previously been sending up a trickle every couple of days but it seems to have been stuck at its present position for longer than that. I think there's no realistic hope of rescuing it and you'll have to abort it. This is a susceptibility of a small proportion of these Hadcm models and doesn't imply anything wrong with your computer. But it will have uploaded its first decadal file and this will contain useful data, so your crunching time hasn't been in vain. Before you abort it could you please open its graphics / globe window and tell us a) what the model date is b) whether it's looping between a few days, repeating them endlessly, and if this is the case which date it loops back to. Cpdn news ID: 45828 · Reply Quote