Message boards : Number crunching : tasks restarted at 0% after sitting at 100%
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Nov 14 Posts: 4 Credit: 130,886 RAC: 0 |
I had eight tasks running in parallel. After about 202 hours of elapsed time, for about 192 hours of compute time, five had reached 100% with estimated remaining shown as "..." After half an hour or so this situation had not changed, except that elapsed time and compute time both kept increasing. For other reasons, I then rebooted my machine, and all the tasks went back to 0% completed, but still with 202 hours of elapsed time. The estimated remaining was now 86000 hours (apparently a maximum value). When they had again run to the point where about 0.25% was completed, the estimated time to completion began to drop below 86000, as expected. Does that mean 1000 cpu-hours were just wasted on those five tasks? Should I expect the same on the other three when they reach 100% ? Is there any point in letting the original five run again? This is my first foray into climate prediction, so I'm not inclined to waste more cycles until I understand what happened. I'm running a MacBook Pro with one Intel Core i7 processor, 4 cores hyper-threaded, with 16 GB of memory and 1 TB of SSD. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
James, Based on my experience, as a Mac user, I would say this is a Mac problem. As far as I can establish, all CPDN applications whose main version number is 7 will fail at the end on Macs. The Zip uploads will have gone through, which contain the science data. However, because the task fails it will be reissued to another user (unless your task was the last one in its work unit). My workaround until the 7-series Mac problem is fixed is to alter the project preferences to exclude all the affected applications. That leaves the EU and ANZ models from the HADAM3P family and the 40-year HADCM3N model. (Note that the HADCM3N is a longer model and prone to crash at 10-year uploads.) Not great news, I'm sorry to say - but there are EU and ANZ models aplenty at the moment. Iain |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Should have waited till an experienced Mac user (Ian) had replied before starting my reply! The recommended procedure for avoiding problems when closing down or rebooting is to 1. suspend computation. 2. exit BOINC either by using ctrl Q or the file - exit dialogue. Just clicking the x in top right hand corner doesn't actually exit, at least certainly not on my linux box. You may well have done this and still be experiencing problems. It might be useful if you could look at your tasks page hhttp://ttp://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1347338&offset=0&show_names=0&state=0 and identify which tasks have been affected. It will then be possible to look at other tasks in the same work units to see if they have completed. If looking at your tasks page, they show as having completed, you can safely delete them from your projects/climateprediction.net folder. Abort the tasks from BOINC first. One other thing occurs to me, nothing has completed for CPDN since 14th December. Have you made any system changes since then which may account for that? I did look on the Mac section of the forum and didn't find any other instances of the problem you describe. |
Send message Joined: 27 Nov 14 Posts: 4 Credit: 130,886 RAC: 0 |
The tasks were for Africa 7.22, so that seems relevant. But the symptoms were different from what you describe. Although the tasks reached a reported level of 100%, results were never sent back, I think I got no credit, and each of the tasks resumed computing, but back at 0%. As far as I can tell, the final results just vanished, but the CPU and clock timers kept going from their old values as the computation started over from scratch. |
Send message Joined: 27 Nov 14 Posts: 4 Credit: 130,886 RAC: 0 |
Belay that. I think maybe I did get credit for those runs. How would I know? My account page shows 13.35 completed runs, but I don't know if those 5 are among them. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
You can tell if results, both zips and trickle_up files, are being/have been sent back, by looking at the Event Log. Transactions begin with your BOINC client making a scheduler request to the server, with the reason in the second part of the line. The Event Log will also have the name of the model. Credits are granted for each trickle_up file returned. The scripts run once per day, and normally the external stats sites will pick up the data within a day. However, as posted else where, the front end server is down, after being attacked on Christmas Eve. So the stats sites will need to be getting data from our backup url, and they may not be. so, time to introduce you to my word for situations such as this: Patience. |
Send message Joined: 27 Nov 14 Posts: 4 Credit: 130,886 RAC: 0 |
Ok. Thanks. |
©2024 cpdn.org