climateprediction.net (CPDN) home page
Thread 'Abnormally long-running models'

Thread 'Abnormally long-running models'

Message boards : Number crunching : Abnormally long-running models
Message board moderation

To post messages, you must log in.

AuthorMessage
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57884 - Posted: 5 Mar 2018, 1:47:08 UTC
Last modified: 5 Mar 2018, 1:56:57 UTC

Hey,

This task below has been running on my Ryzen 7 machine for more than 8.5 days, and shows another 5.5 days remaining:

https://www.cpdn.org/cpdnboinc/result.php?resultid=21095973

Is this normal? Should I let it continue or should I abort it?

The current wah2 tasks on my machine run between 4 and 9 days, so 14 days running is quite unusual.

This NEXT task below had also been running for more than 8 days, and showed about 95% complete and "running," but it was actually stalled and was sitting there all day, accumulating no more run time. The "remaining" column showed "---". I aborted it. I didn't know what else to do.

https://www.cpdn.org/cpdnboinc/result.php?resultid=21095971
ID: 57884 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 57885 - Posted: 5 Mar 2018, 2:10:24 UTC - in response to Message 57884.  

That's a 133 month task compared to most of the tasks issued that have 3 to 24 months. It does take a long time.
ID: 57885 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57888 - Posted: 5 Mar 2018, 4:55:54 UTC - in response to Message 57885.  

Thanks for the info!
ID: 57888 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57889 - Posted: 5 Mar 2018, 10:04:38 UTC

wah2_global_a01k_199412_133_554_010979250_1

If you look at the task name, the first bit in bold is the type of task, the second the number of months, the third the batch no and the last in this instance indicates it is on its second attempt. The batch number is also quite old and as the first computer hasn't sent anything back since last April the machine is probably no longer crunching.
ID: 57889 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57890 - Posted: 5 Mar 2018, 13:34:29 UTC - in response to Message 57889.  

Thanks, Dave.

Do you know what went wrong with the second task that I listed? Did I do the right thing by aborting it?
ID: 57890 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,610,380
RAC: 3,377
Message 57891 - Posted: 5 Mar 2018, 17:36:35 UTC - in response to Message 57890.  
Last modified: 5 Mar 2018, 17:39:59 UTC

I don't think so. There are about 13 hours between two trickles. Only if the latest trickle is much longer ago, you don't run any other projects on the same computer and there aren't any upload problems you can assume the task hangs.
ID: 57891 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,879,577
RAC: 1,213
Message 57892 - Posted: 5 Mar 2018, 18:14:01 UTC

CPDN tasks have exhibited a bunch of strange behaviors over the years and there are a lot of tricks to goose the tasks when they appear to hang. It is usually best to give the task a couple of multiples of a trickle time to see if there was just a long tail (unlikely but possible). Other things I have tried with varying degrees of success are below NOTE: MAKE SURE THE "Leave Non-GMO Tasks in memory while suspended" box is checked before you try these.

1. Suspend the task for a few minutes then allow it to run. It might complete, it might lose a few % points and have to run to completion and it might have no affect.

2. Cleanly exit the application, allowing it to shut down the tasks as it closes. If you have installed BOINC as a service you'll probably have to reboot to get the services to restart, otherwise just restart the app (BOINC) and see what happens. Potential results are the same as above with the added possibility that it will exit with a fatal error, which will leave you in the same place as an abort.

In both cases give it an hour or two, to see if it made a difference.
ID: 57892 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 57893 - Posted: 5 Mar 2018, 21:02:53 UTC - in response to Message 57890.  

Thanks, Dave.

Do you know what went wrong with the second task that I listed? Did I do the right thing by aborting it?

There used to be relatively more tasks that would just kind of stop progressing. It seems like there are far fewer of these types of problems nowadays, but it still happens. The suggestions from DadX are worth trying at that point if no progress has been made for quite awhile. If those attempts don't do it, I would abort the task at that point.
ID: 57893 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 57894 - Posted: 5 Mar 2018, 23:14:11 UTC

DadX: I had to do a suspend/restart with this model recently to get it out of what looked like a loop at the end of its run. The stderr text shows one suspend, and a larger than usual difference between the "Run time" and "CPU time" amounts.

As geophi says, this doesn't happen often now but it does happen sometimes. I don't know whether it would have got out of its loop eventually, but I lost patience and the restart from a checkpoint did the trick ...
ID: 57894 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57896 - Posted: 6 Mar 2018, 2:21:43 UTC
Last modified: 6 Mar 2018, 2:22:12 UTC

Thanks for the tips! I will try these if it happens again in the future with other models.
ID: 57896 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57898 - Posted: 6 Mar 2018, 7:19:41 UTC

Thanks for the tips! I will try these if it happens again in the future with other models.


Though looking on the model page, it has a suspend thread error. I have not seen a task complete with this. Previous batches had a few failures with this error.
(706, 707, 708) some early on others after several zips uploaded. I couldn't see any pattern in the failures.
[/quote]
ID: 57898 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57906 - Posted: 8 Mar 2018, 1:26:08 UTC - in response to Message 57898.  
Last modified: 8 Mar 2018, 1:29:46 UTC

Strange - during the entire time I have a batch of models running I am careful to not interrupt them in ANY way, including suspends. I learned THAT lesson a while back.

Since you're seeing evidence that a suspend occurred which caused a model failure, then it must have been when I was fiddling with this hung task, trying to get it going again (it was gaining no CPU time for most of the day, and showing NO time left until completion).
ID: 57906 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 57908 - Posted: 8 Mar 2018, 8:28:09 UTC - in response to Message 57906.  

Since you're seeing evidence that a suspend occurred which caused a model failure, then it must have been when I was fiddling with this hung task, trying to get it going again (it was gaining no CPU time for most of the day, and showing NO time left until completion).


I suspect that the suspend thread error is something to do with the task rather than your trying to get it going again.
ID: 57908 · Report as offensive     Reply Quote
MossyRock
Avatar

Send message
Joined: 4 Oct 13
Posts: 27
Credit: 2,301,681
RAC: 7,632
Message 57910 - Posted: 8 Mar 2018, 14:39:51 UTC - in response to Message 57908.  

Thank you, Dave.
ID: 57910 · Report as offensive     Reply Quote

Message boards : Number crunching : Abnormally long-running models

©2024 cpdn.org