Questions and Answers : Unix/Linux : computation error at 100% complete
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2. My experience is that stopping by pulling the power cord is more likely to cause models to crash. I don't think I have ever had a CPDN task crash when just suspending. My way of doing it is, 1. Suspend computing globally in BOINC, 2. Suspend each task individually. Wait at least two minutes then exit. I then run top just to check the client really has exited in case I have been careless and ticked the wrong thing in manager exit dialogue. (This may not be present in your version of BOINC.) I seem to have been lucky with my new toy and haven't yet lost any tasks when stopping and starting this I think lends credence to the theory that exiting while BOINC is still doing a disk write may cause some of the problems - new toy has an NVME SSD. The routine I follow is more or less as suggested by one of my fellow moderators back in the days when this was a much bigger problem. Lesson for me is - don't assume that anyone wanting to follow my instructions will know the bits I left out! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties. So, unless I have a following wind with my bored band lack of speed, I should suspend computation at that point. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Well,well,well its a success. And it was still uploading the last 192meg file when it reached 100%. So I will try the new method of stopping the processing of w/u. I had worked with computers for endless years and always hated using the on/off switch to solve issues. But power cuts never seem to kill a climate w/u. Just luck I guess. Good idea to use top to check if the process really has cleared off. Lets hope the re-issued w/u work better when they arrive. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
So, I used the new method of suspending these wu before stopping the client. Then checking to make sure they have gone. It seem to have worked for 1 machine. All 4 w/u restarted. And sofar are still running. On the second machine the same procedure applied and both w/u failed with computation error when restarted. Maybe its a fedora 30 thing...... Thats some 50 days of processing lost/wasted today. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I'm going with the "little Bo Peep" method: Leave them alone And they'll come home Wagging their tails behind them. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
#882 are the resends of 877 with the file size limit increased. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Ok, I have 4 WUs that progressed between 20 and 75 %. So some zips have uploaded. Should I change all instances of <max_nbytes>150000000.000000</max_nbytes> for yeach WU or only to the remaining zips? There are also other files have <max_nbytes>0.000000</max_nbytes> so I guess I should not alter these? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If you have a fast internet connection, I wouldn't worry about it. The last zip will upload before the program gets to the problem part right at the end of "extra time". |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
On the other hand I have a slow internet connection so in a text editor I did search and replace looking for all instances of <max_nbytes>150000000.000000</max_nbytes> and doubled the value so replacing them with <max_nbytes>300000000.000000</max_nbytes> There is a risk with stopping and starting tasks that some might crash. (I haven't had any do so yet on this machine but fingers crossed!) But if you don't exit the client, it will write the old values back in which I knew but forgot about the first time I did it. So the procedure is not totally without risk. I am running five tasks concurrently at the moment. I have still to sit down and scientifically test how throughput varies with the number of tasks running with their high demand of cache memory on CPU. So as Les says if your internet connection is fast enough, you are probably better leaving them. I certainly think it is worth completing them as some are likely to be hard failures given the number of machines without the 32bit libs. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,996,606 RAC: 14,349 |
If you have a fast internet connection, I wouldn't worry about it. This could be why my two were OK as I am on high speed fibre (apart from the first 50metres). I'll let the others I have run. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I changed the values following the process suggested and restatred the client. So far all 4 WUs are running ok, 1 zip uploaded. I have one more task on another machine, but since I have relatively high speed internet I will risk with this one. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
As a postscript, my two batch 878 tasks were safely delivered au naturel overnight, without any manual intervention. Here's the sequence of events, with timings, for others to base their decisions on. 08/11/2020 05:54:37 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip 08/11/2020 05:54:39 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip 08/11/2020 05:54:42 | climateprediction.net | Sending scheduler request: To send trickle-up message. 08/11/2020 05:54:59 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip 08/11/2020 05:56:47 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip 08/11/2020 06:16:44 | climateprediction.net | Computation for task hadam4h_c0yb_206511_5_878_012031093_0 finished 08/11/2020 06:16:46 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip 08/11/2020 06:16:49 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip 08/11/2020 06:55:24 | climateprediction.net | Sending scheduler request: To report completed tasks. 08/11/2020 06:55:26 | climateprediction.net | [sched_op] handle_scheduler_reply(): got ack for task hadam4h_c0yb_206511_5_878_012031093_0First, the restart file, taking just two seconds. That's never going to be a problem. Then, the big _5.zip that we were worried about (~200 MB). On my fast line, it took less than two minutes. Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine. If your balance (two minute upload versus 20 minute finishing stretch) is substantially different, precautions might be wise. But otherwise, just relax. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance.Well, the question is beginning to become moot with the change to the templates - I've downloaded a replacement task since I posted, and confirmed that the file (in batch 887, this time) have a safe size limit. That limit is 210,000,000 bytes (200.27 MB in binary), which is still a bit close - I'd have preferred it if they'd gone up to 250 MB/MiB (either would do!). Your problem might be helped if you staggered the task starts. I always have CPU tasks from other projects running alongside CPDN tasks: under those conditions, BOINC only ever allocates / downloads one task at a time. CPDN has set the server timeout so that a second request won't be made until a full hour has passed: if you can upload the problem files in less than an hour, that interval should be enough. [On cue, I can see my second replacement task - also 887 - downloading as I type] During times of drought, you could let some lightweight project run as a resource share zero backup project. Then, when a new batch is released, each core should transition back to CPDN at hourly intervals. If the batch has uniform runtimes - they usually do - the finishes ought to follow that hourly interval as well. Small additional bonus - keeping the cores warm while you're waiting should reduce wear and tear from thermal cycling! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine. just uploaded a 4.zip from one of my tasks - 50 minutes. Other half on a Zoom call during the upload but shouldn't have made that much difference. My old laptop would probably take so long for the bit after the zip is created that it would be fine but the Ryzen it wouldn't stand a chance. |
©2024 cpdn.org