Message boards : Number crunching : Abnormally long-running models
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Hey, This task below has been running on my Ryzen 7 machine for more than 8.5 days, and shows another 5.5 days remaining: https://www.cpdn.org/cpdnboinc/result.php?resultid=21095973 Is this normal? Should I let it continue or should I abort it? The current wah2 tasks on my machine run between 4 and 9 days, so 14 days running is quite unusual. This NEXT task below had also been running for more than 8 days, and showed about 95% complete and "running," but it was actually stalled and was sitting there all day, accumulating no more run time. The "remaining" column showed "---". I aborted it. I didn't know what else to do. https://www.cpdn.org/cpdnboinc/result.php?resultid=21095971 |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
That's a 133 month task compared to most of the tasks issued that have 3 to 24 months. It does take a long time. |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Thanks for the info! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946 |
wah2_global_a01k_199412_133_554_010979250_1 If you look at the task name, the first bit in bold is the type of task, the second the number of months, the third the batch no and the last in this instance indicates it is on its second attempt. The batch number is also quite old and as the first computer hasn't sent anything back since last April the machine is probably no longer crunching. |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Thanks, Dave. Do you know what went wrong with the second task that I listed? Did I do the right thing by aborting it? |
Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,610,380 RAC: 3,377 |
I don't think so. There are about 13 hours between two trickles. Only if the latest trickle is much longer ago, you don't run any other projects on the same computer and there aren't any upload problems you can assume the task hangs. |
Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,879,577 RAC: 1,213 |
CPDN tasks have exhibited a bunch of strange behaviors over the years and there are a lot of tricks to goose the tasks when they appear to hang. It is usually best to give the task a couple of multiples of a trickle time to see if there was just a long tail (unlikely but possible). Other things I have tried with varying degrees of success are below NOTE: MAKE SURE THE "Leave Non-GMO Tasks in memory while suspended" box is checked before you try these. 1. Suspend the task for a few minutes then allow it to run. It might complete, it might lose a few % points and have to run to completion and it might have no affect. 2. Cleanly exit the application, allowing it to shut down the tasks as it closes. If you have installed BOINC as a service you'll probably have to reboot to get the services to restart, otherwise just restart the app (BOINC) and see what happens. Potential results are the same as above with the added possibility that it will exit with a fatal error, which will leave you in the same place as an abort. In both cases give it an hour or two, to see if it made a difference. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Thanks, Dave. There used to be relatively more tasks that would just kind of stop progressing. It seems like there are far fewer of these types of problems nowadays, but it still happens. The suggestions from DadX are worth trying at that point if no progress has been made for quite awhile. If those attempts don't do it, I would abort the task at that point. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
DadX: I had to do a suspend/restart with this model recently to get it out of what looked like a loop at the end of its run. The stderr text shows one suspend, and a larger than usual difference between the "Run time" and "CPU time" amounts. As geophi says, this doesn't happen often now but it does happen sometimes. I don't know whether it would have got out of its loop eventually, but I lost patience and the restart from a checkpoint did the trick ... |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Thanks for the tips! I will try these if it happens again in the future with other models. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946 |
Thanks for the tips! I will try these if it happens again in the future with other models. Though looking on the model page, it has a suspend thread error. I have not seen a task complete with this. Previous batches had a few failures with this error. (706, 707, 708) some early on others after several zips uploaded. I couldn't see any pattern in the failures. [/quote] |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Strange - during the entire time I have a batch of models running I am careful to not interrupt them in ANY way, including suspends. I learned THAT lesson a while back. Since you're seeing evidence that a suspend occurred which caused a model failure, then it must have been when I was fiddling with this hung task, trying to get it going again (it was gaining no CPU time for most of the day, and showing NO time left until completion). |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946 |
Since you're seeing evidence that a suspend occurred which caused a model failure, then it must have been when I was fiddling with this hung task, trying to get it going again (it was gaining no CPU time for most of the day, and showing NO time left until completion). I suspect that the suspend thread error is something to do with the task rather than your trying to get it going again. |
Send message Joined: 4 Oct 13 Posts: 27 Credit: 2,301,681 RAC: 7,632 |
Thank you, Dave. |
©2024 cpdn.org