climateprediction.net (CPDN) home page
Thread 'tasks restarted at 0% after sitting at 100%'

Thread 'tasks restarted at 0% after sitting at 100%'

Message boards : Number crunching : tasks restarted at 0% after sitting at 100%
Message board moderation

To post messages, you must log in.

AuthorMessage
James McDonald

Send message
Joined: 27 Nov 14
Posts: 4
Credit: 130,886
RAC: 0
Message 51115 - Posted: 4 Jan 2015, 10:55:54 UTC

I had eight tasks running in parallel. After about 202 hours of elapsed time, for about 192 hours of compute time, five had reached 100% with estimated remaining shown as "..."

After half an hour or so this situation had not changed, except that elapsed time and compute time both kept increasing.

For other reasons, I then rebooted my machine, and all the tasks went back to 0% completed, but still with 202 hours of elapsed time. The estimated remaining was now 86000 hours (apparently a maximum value). When they had again run to the point where about 0.25% was completed, the estimated time to completion began to drop below 86000, as expected.

Does that mean 1000 cpu-hours were just wasted on those five tasks? Should I expect the same on the other three when they reach 100% ? Is there any point in letting the original five run again?

This is my first foray into climate prediction, so I'm not inclined to waste more cycles until I understand what happened.

I'm running a MacBook Pro with one Intel Core i7 processor, 4 cores hyper-threaded, with 16 GB of memory and 1 TB of SSD.
ID: 51115 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 51116 - Posted: 4 Jan 2015, 11:24:40 UTC - in response to Message 51115.  

James,

Based on my experience, as a Mac user, I would say this is a Mac problem. As far as I can establish, all CPDN applications whose main version number is 7 will fail at the end on Macs. The Zip uploads will have gone through, which contain the science data. However, because the task fails it will be reissued to another user (unless your task was the last one in its work unit).

My workaround until the 7-series Mac problem is fixed is to alter the project preferences to exclude all the affected applications. That leaves the EU and ANZ models from the HADAM3P family and the 40-year HADCM3N model. (Note that the HADCM3N is a longer model and prone to crash at 10-year uploads.)

Not great news, I'm sorry to say - but there are EU and ANZ models aplenty at the moment.

Iain
ID: 51116 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 51117 - Posted: 4 Jan 2015, 11:25:05 UTC - in response to Message 51115.  
Last modified: 4 Jan 2015, 12:10:00 UTC

Should have waited till an experienced Mac user (Ian) had replied before starting my reply!

The recommended procedure for avoiding problems when closing down or rebooting is to
1. suspend computation.
2. exit BOINC either by using ctrl Q or the file - exit dialogue. Just clicking the x in top right hand corner doesn't actually exit, at least certainly not on my linux box.

You may well have done this and still be experiencing problems. It might be useful if you could look at your tasks page hhttp://ttp://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1347338&offset=0&show_names=0&state=0 and identify which tasks have been affected. It will then be possible to look at other tasks in the same work units to see if they have completed.

If looking at your tasks page, they show as having completed, you can safely delete them from your projects/climateprediction.net folder. Abort the tasks from BOINC first.

One other thing occurs to me, nothing has completed for CPDN since 14th December. Have you made any system changes since then which may account for that?

I did look on the Mac section of the forum and didn't find any other instances of the problem you describe.
ID: 51117 · Report as offensive     Reply Quote
James McDonald

Send message
Joined: 27 Nov 14
Posts: 4
Credit: 130,886
RAC: 0
Message 51122 - Posted: 5 Jan 2015, 0:54:36 UTC - in response to Message 51116.  

The tasks were for Africa 7.22, so that seems relevant.

But the symptoms were different from what you describe. Although the tasks reached
a reported level of 100%, results were never sent back, I think I got no credit, and each
of the tasks resumed computing, but back at 0%.

As far as I can tell, the final results just vanished, but the CPU and clock timers kept
going from their old values as the computation started over from scratch.

ID: 51122 · Report as offensive     Reply Quote
James McDonald

Send message
Joined: 27 Nov 14
Posts: 4
Credit: 130,886
RAC: 0
Message 51123 - Posted: 5 Jan 2015, 0:57:10 UTC - in response to Message 51116.  

Belay that. I think maybe I did get credit for those runs.

How would I know? My account page shows 13.35 completed runs,
but I don't know if those 5 are among them.
ID: 51123 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51124 - Posted: 5 Jan 2015, 3:17:49 UTC

You can tell if results, both zips and trickle_up files, are being/have been sent back, by looking at the Event Log.
Transactions begin with your BOINC client making a scheduler request to the server, with the reason in the second part of the line.

The Event Log will also have the name of the model.

Credits are granted for each trickle_up file returned.
The scripts run once per day, and normally the external stats sites will pick up the data within a day.
However, as posted else where, the front end server is down, after being attacked on Christmas Eve. So the stats sites will need to be getting data from our backup url, and they may not be.

so, time to introduce you to my word for situations such as this: Patience.

ID: 51124 · Report as offensive     Reply Quote
James McDonald

Send message
Joined: 27 Nov 14
Posts: 4
Credit: 130,886
RAC: 0
Message 51125 - Posted: 5 Jan 2015, 4:05:11 UTC - in response to Message 51124.  

Ok. Thanks.

ID: 51125 · Report as offensive     Reply Quote

Message boards : Number crunching : tasks restarted at 0% after sitting at 100%

©2024 cpdn.org