climateprediction.net (CPDN) home page
Thread 'Waiting to run (scheduler wait)'

Thread 'Waiting to run (scheduler wait)'

Message boards : Number crunching : Waiting to run (scheduler wait)
Message board moderation

To post messages, you must log in.

AuthorMessage
Bellator
Avatar

Send message
Joined: 31 Mar 05
Posts: 44
Credit: 234,235
RAC: 0
Message 46873 - Posted: 26 Aug 2013, 5:23:59 UTC

This I have not seen before. I have two tasks running, but the third one has been frozen for more than five hours now. The fourth died yesterday from a computer error...
ID: 46873 · Report as offensive     Reply Quote
ProfileGreg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 46875 - Posted: 26 Aug 2013, 5:44:11 UTC - in response to Message 46873.  
Last modified: 26 Aug 2013, 5:47:38 UTC

I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. The graphics display looks normal, not blue as they usually do when a model goes bad. Restarting the computer does not change things.

hadcm3n_o2h3_1940

Update: see also this thread in the Unix section. Same problem.
ID: 46875 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46877 - Posted: 26 Aug 2013, 7:31:34 UTC
Last modified: 26 Aug 2013, 7:48:22 UTC



'Waiting to run' usually simply means that the Boinc task is waiting for an available core to run on, while the other cores are busy with other tasks.

Jord says on a different forum:
The "scheduler wait" message in 6.13 is essentially the same as the "waiting for memory" message in 6.12; a GPU does not have enough memory to continue the work.

But in 6.12 it would also come up when other causes left no memory to be used. Essentially, what this message now means is that the application was temporarily exited by BOINC and is waiting to be rescheduled, to run again at a later time when we hope enough memory is available.
...

In that context, he is talking about the GPU, but I presume that the same applies for CPU tasks.




I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. ...
Update: see also this thread in the Unix section. Same problem.



The 25/50/75% stuff is something else ... these points are where the model does some extra work - firstly validation, to make sure that nothing has gone out of realistic bounds, and secondly when it generates extra output files. Note that one of the things that the project is trying to find out is which parameter sets are viable and which will end up with unrealistic models. So a crash at this point is not necessarily bad (depending on whether the error came from the original input parameters, or the PC).

So if something had gone wrong earlier, this is the point where the task is supposed to crash out. However some people find rather than crashing out, it gets stuck & needs to be aborted.

Also, the model does not like to be interrupted at this point either - on my PC, I have changed the boinc settings so that it does not try to suspend the job when it sees CPU activity, and it stays in memory rather than being migrated to disk.

If you see a lot of problems at these points (rather than just the occasional model), then it is worth running a stability check (& also dialling down overclocking if you are O/Ced).
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46877 · Report as offensive     Reply Quote

Message boards : Number crunching : Waiting to run (scheduler wait)

©2024 cpdn.org