computation error at 100% complete

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62834 - Posted: 2 Nov 2020, 21:08:56 UTC So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2. I now just pull the power core out. My experience is that stopping by pulling the power cord is more likely to cause models to crash. I don't think I have ever had a CPDN task crash when just suspending. My way of doing it is, 1. Suspend computing globally in BOINC, 2. Suspend each task individually. Wait at least two minutes then exit. I then run top just to check the client really has exited in case I have been careless and ticked the wrong thing in manager exit dialogue. (This may not be present in your version of BOINC.) I seem to have been lucky with my new toy and haven't yet lost any tasks when stopping and starting this I think lends credence to the theory that exiting while BOINC is still doing a disk write may cause some of the problems - new toy has an NVME SSD. The routine I follow is more or less as suggested by one of my fellow moderators back in the days when this was a much bigger problem. Lesson for me is - don't assume that anyone wanting to follow my instructions will know the bits I left out! ID: 62834 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62835 - Posted: 2 Nov 2020, 21:10:37 UTC - in response to Message 62831. I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties. Which doesn't matter so much. So, unless I have a following wind with my bored band lack of speed, I should suspend computation at that point. ID: 62835 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62836 - Posted: 2 Nov 2020, 21:43:38 UTC Well,well,well its a success. And it was still uploading the last 192meg file when it reached 100%. So I will try the new method of stopping the processing of w/u. I had worked with computers for endless years and always hated using the on/off switch to solve issues. But power cuts never seem to kill a climate w/u. Just luck I guess. Good idea to use top to check if the process really has cleared off. Lets hope the re-issued w/u work better when they arrive. ID: 62836 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62837 - Posted: 2 Nov 2020, 23:25:57 UTC So, I used the new method of suspending these wu before stopping the client. Then checking to make sure they have gone. It seem to have worked for 1 machine. All 4 w/u restarted. And sofar are still running. On the second machine the same procedure applied and both w/u failed with computation error when restarted. Maybe its a fedora 30 thing...... Thats some 50 days of processing lost/wasted today. ID: 62837 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62838 - Posted: 2 Nov 2020, 23:32:33 UTC I'm going with the "little Bo Peep" method: Leave them alone And they'll come home Wagging their tails behind them. ID: 62838 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62840 - Posted: 3 Nov 2020, 6:31:55 UTC #882 are the resends of 877 with the file size limit increased. ID: 62840 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981	Message 62848 - Posted: 4 Nov 2020, 19:44:16 UTC Ok, I have 4 WUs that progressed between 20 and 75 %. So some zips have uploaded. Should I change all instances of <max_nbytes>150000000.000000</max_nbytes> for yeach WU or only to the remaining zips? There are also other files have <max_nbytes>0.000000</max_nbytes> so I guess I should not alter these? ID: 62848 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62849 - Posted: 4 Nov 2020, 20:07:33 UTC - in response to Message 62848. If you have a fast internet connection, I wouldn't worry about it. The last zip will upload before the program gets to the problem part right at the end of "extra time". ID: 62849 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62850 - Posted: 4 Nov 2020, 20:29:00 UTC On the other hand I have a slow internet connection so in a text editor I did search and replace looking for all instances of <max_nbytes>150000000.000000</max_nbytes> and doubled the value so replacing them with <max_nbytes>300000000.000000</max_nbytes> There is a risk with stopping and starting tasks that some might crash. (I haven't had any do so yet on this machine but fingers crossed!) But if you don't exit the client, it will write the old values back in which I knew but forgot about the first time I did it. So the procedure is not totally without risk. I am running five tasks concurrently at the moment. I have still to sit down and scientifically test how throughput varies with the number of tasks running with their high demand of cache memory on CPU. So as Les says if your internet connection is fast enough, you are probably better leaving them. I certainly think it is worth completing them as some are likely to be hard failures given the number of machines without the 32bit libs. ID: 62850 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,946,057 RAC: 13,930	Message 62851 - Posted: 4 Nov 2020, 23:48:59 UTC - in response to Message 62849. If you have a fast internet connection, I wouldn't worry about it. The last zip will upload before the program gets to the problem part right at the end of "extra time". This could be why my two were OK as I am on high speed fibre (apart from the first 50metres). I'll let the others I have run. ID: 62851 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981	Message 62856 - Posted: 5 Nov 2020, 13:32:52 UTC - in response to Message 62850. I changed the values following the process suggested and restatred the client. So far all 4 WUs are running ok, 1 zip uploaded. I have one more task on another machine, but since I have relatively high speed internet I will risk with this one. ID: 62856 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 62899 - Posted: 8 Nov 2020, 10:11:25 UTC As a postscript, my two batch 878 tasks were safely delivered au naturel overnight, without any manual intervention. Here's the sequence of events, with timings, for others to base their decisions on. 08/11/2020 05:54:37 \| climateprediction.net \| Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip 08/11/2020 05:54:39 \| climateprediction.net \| Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip 08/11/2020 05:54:42 \| climateprediction.net \| Sending scheduler request: To send trickle-up message. 08/11/2020 05:54:59 \| climateprediction.net \| Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip 08/11/2020 05:56:47 \| climateprediction.net \| Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip 08/11/2020 06:16:44 \| climateprediction.net \| Computation for task hadam4h_c0yb_206511_5_878_012031093_0 finished 08/11/2020 06:16:46 \| climateprediction.net \| Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip 08/11/2020 06:16:49 \| climateprediction.net \| Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip 08/11/2020 06:55:24 \| climateprediction.net \| Sending scheduler request: To report completed tasks. 08/11/2020 06:55:26 \| climateprediction.net \| [sched_op] handle_scheduler_reply(): got ack for task hadam4h_c0yb_206511_5_878_012031093_0 First, the restart file, taking just two seconds. That's never going to be a problem. Then, the big _5.zip that we were worried about (~200 MB). On my fast line, it took less than two minutes. Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine. If your balance (two minute upload versus 20 minute finishing stretch) is substantially different, precautions might be wise. But otherwise, just relax. ID: 62899 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62900 - Posted: 8 Nov 2020, 11:34:30 UTC - in response to Message 62899. Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance. ID: 62900 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 62901 - Posted: 8 Nov 2020, 14:04:11 UTC - in response to Message 62900. Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance. Well, the question is beginning to become moot with the change to the templates - I've downloaded a replacement task since I posted, and confirmed that the file (in batch 887, this time) have a safe size limit. That limit is 210,000,000 bytes (200.27 MB in binary), which is still a bit close - I'd have preferred it if they'd gone up to 250 MB/MiB (either would do!). Your problem might be helped if you staggered the task starts. I always have CPU tasks from other projects running alongside CPDN tasks: under those conditions, BOINC only ever allocates / downloads one task at a time. CPDN has set the server timeout so that a second request won't be made until a full hour has passed: if you can upload the problem files in less than an hour, that interval should be enough. [On cue, I can see my second replacement task - also 887 - downloading as I type] During times of drought, you could let some lightweight project run as a resource share zero backup project. Then, when a new batch is released, each core should transition back to CPDN at hourly intervals. If the batch has uniform runtimes - they usually do - the finishes ought to follow that hourly interval as well. Small additional bonus - keeping the cores warm while you're waiting should reduce wear and tear from thermal cycling! ID: 62901 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62924 - Posted: 11 Nov 2020, 10:47:58 UTC Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine. just uploaded a 4.zip from one of my tasks - 50 minutes. Other half on a Zoom call during the upload but shouldn't have made that much difference. My old laptop would probably take so long for the bit after the zip is created that it would be fine but the Ryzen it wouldn't stand a chance. ID: 62924 · Reply Quote