Message boards : Number crunching : Waiting to run (scheduler wait)
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0 |
This I have not seen before. I have two tasks running, but the third one has been frozen for more than five hours now. The fourth died yesterday from a computer error... |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. The graphics display looks normal, not blue as they usually do when a model goes bad. Restarting the computer does not change things. hadcm3n_o2h3_1940 Update: see also this thread in the Unix section. Same problem. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
'Waiting to run' usually simply means that the Boinc task is waiting for an available core to run on, while the other cores are busy with other tasks. Jord says on a different forum: The "scheduler wait" message in 6.13 is essentially the same as the "waiting for memory" message in 6.12; a GPU does not have enough memory to continue the work. In that context, he is talking about the GPU, but I presume that the same applies for CPU tasks. I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. ... The 25/50/75% stuff is something else ... these points are where the model does some extra work - firstly validation, to make sure that nothing has gone out of realistic bounds, and secondly when it generates extra output files. Note that one of the things that the project is trying to find out is which parameter sets are viable and which will end up with unrealistic models. So a crash at this point is not necessarily bad (depending on whether the error came from the original input parameters, or the PC). So if something had gone wrong earlier, this is the point where the task is supposed to crash out. However some people find rather than crashing out, it gets stuck & needs to be aborted. Also, the model does not like to be interrupted at this point either - on my PC, I have changed the boinc settings so that it does not try to suspend the job when it sees CPU activity, and it stays in memory rather than being migrated to disk. If you see a lot of problems at these points (rather than just the occasional model), then it is worth running a stability check (& also dialling down overclocking if you are O/Ced). I'm a volunteer and my views are my own. News and Announcements and FAQ |
©2024 cpdn.org