climateprediction.net (CPDN) home page
Thread 'computation error at 100% complete'

Thread 'computation error at 100% complete'

Questions and Answers : Unix/Linux : computation error at 100% complete
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62834 - Posted: 2 Nov 2020, 21:08:56 UTC

So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2.

I now just pull the power core out.


My experience is that stopping by pulling the power cord is more likely to cause models to crash. I don't think I have ever had a CPDN task crash when just suspending. My way of doing it is, 1. Suspend computing globally in BOINC, 2. Suspend each task individually. Wait at least two minutes then exit. I then run top just to check the client really has exited in case I have been careless and ticked the wrong thing in manager exit dialogue. (This may not be present in your version of BOINC.) I seem to have been lucky with my new toy and haven't yet lost any tasks when stopping and starting this I think lends credence to the theory that exiting while BOINC is still doing a disk write may cause some of the problems - new toy has an NVME SSD. The routine I follow is more or less as suggested by one of my fellow moderators back in the days when this was a much bigger problem.

Lesson for me is - don't assume that anyone wanting to follow my instructions will know the bits I left out!
ID: 62834 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62835 - Posted: 2 Nov 2020, 21:10:37 UTC - in response to Message 62831.  

I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties.
Which doesn't matter so much.


So, unless I have a following wind with my bored band lack of speed, I should suspend computation at that point.
ID: 62835 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62836 - Posted: 2 Nov 2020, 21:43:38 UTC

Well,well,well its a success. And it was still uploading the last 192meg file when it reached 100%.

So I will try the new method of stopping the processing of w/u. I had worked with computers for endless years and always hated using the on/off switch to solve issues. But power cuts never seem to kill a climate w/u. Just luck I guess.

Good idea to use top to check if the process really has cleared off.
Lets hope the re-issued w/u work better when they arrive.
ID: 62836 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62837 - Posted: 2 Nov 2020, 23:25:57 UTC

So, I used the new method of suspending these wu before stopping the client. Then checking to make sure they have gone. It seem to have worked for 1 machine. All 4 w/u restarted. And sofar are still running.
On the second machine the same procedure applied and both w/u failed with computation error when restarted.

Maybe its a fedora 30 thing...... Thats some 50 days of processing lost/wasted today.
ID: 62837 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62838 - Posted: 2 Nov 2020, 23:32:33 UTC

I'm going with the "little Bo Peep" method:
Leave them alone
And they'll come home
Wagging their tails behind them.
ID: 62838 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62840 - Posted: 3 Nov 2020, 6:31:55 UTC

#882 are the resends of 877 with the file size limit increased.
ID: 62840 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 62848 - Posted: 4 Nov 2020, 19:44:16 UTC

Ok,
I have 4 WUs that progressed between 20 and 75 %. So some zips have uploaded. Should I change all instances of
<max_nbytes>150000000.000000</max_nbytes> for yeach WU or only to the remaining zips?

There are also other files have <max_nbytes>0.000000</max_nbytes> so I guess I should not alter these?
ID: 62848 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62849 - Posted: 4 Nov 2020, 20:07:33 UTC - in response to Message 62848.  

If you have a fast internet connection, I wouldn't worry about it.
The last zip will upload before the program gets to the problem part right at the end of "extra time".
ID: 62849 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62850 - Posted: 4 Nov 2020, 20:29:00 UTC

On the other hand I have a slow internet connection so in a text editor I did search and replace looking for all instances of <max_nbytes>150000000.000000</max_nbytes> and doubled the value so replacing them with <max_nbytes>300000000.000000</max_nbytes>

There is a risk with stopping and starting tasks that some might crash. (I haven't had any do so yet on this machine but fingers crossed!) But if you don't exit the client, it will write the old values back in which I knew but forgot about the first time I did it. So the procedure is not totally without risk. I am running five tasks concurrently at the moment. I have still to sit down and scientifically test how throughput varies with the number of tasks running with their high demand of cache memory on CPU.

So as Les says if your internet connection is fast enough, you are probably better leaving them. I certainly think it is worth completing them as some are likely to be hard failures given the number of machines without the 32bit libs.
ID: 62850 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,995,778
RAC: 14,325
Message 62851 - Posted: 4 Nov 2020, 23:48:59 UTC - in response to Message 62849.  

If you have a fast internet connection, I wouldn't worry about it.
The last zip will upload before the program gets to the problem part right at the end of "extra time".

This could be why my two were OK as I am on high speed fibre (apart from the first 50metres). I'll let the others I have run.
ID: 62851 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 62856 - Posted: 5 Nov 2020, 13:32:52 UTC - in response to Message 62850.  

I changed the values following the process suggested and restatred the client. So far all 4 WUs are running ok, 1 zip uploaded. I have one more task on another machine, but since I have relatively high speed internet I will risk with this one.
ID: 62856 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 62899 - Posted: 8 Nov 2020, 10:11:25 UTC

As a postscript, my two batch 878 tasks were safely delivered au naturel overnight, without any manual intervention. Here's the sequence of events, with timings, for others to base their decisions on.

08/11/2020 05:54:37 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip
08/11/2020 05:54:39 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_restart.zip
08/11/2020 05:54:42 | climateprediction.net | Sending scheduler request: To send trickle-up message.
08/11/2020 05:54:59 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip
08/11/2020 05:56:47 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_5.zip
08/11/2020 06:16:44 | climateprediction.net | Computation for task hadam4h_c0yb_206511_5_878_012031093_0 finished
08/11/2020 06:16:46 | climateprediction.net | Started upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip
08/11/2020 06:16:49 | climateprediction.net | Finished upload of hadam4h_c0yb_206511_5_878_012031093_0_r357715560_out.zip
08/11/2020 06:55:24 | climateprediction.net | Sending scheduler request: To report completed tasks.
08/11/2020 06:55:26 | climateprediction.net | [sched_op] handle_scheduler_reply(): got ack for task hadam4h_c0yb_206511_5_878_012031093_0
First, the restart file, taking just two seconds. That's never going to be a problem.
Then, the big _5.zip that we were worried about (~200 MB). On my fast line, it took less than two minutes.
Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine.

If your balance (two minute upload versus 20 minute finishing stretch) is substantially different, precautions might be wise. But otherwise, just relax.
ID: 62899 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62900 - Posted: 8 Nov 2020, 11:34:30 UTC - in response to Message 62899.  

Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance.
ID: 62900 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 62901 - Posted: 8 Nov 2020, 14:04:11 UTC - in response to Message 62900.  

Thanks for that Richard. The real problem for me is if I have two or more tasks finishing about the same time. I will just about get away with one but more than that, no chance.
Well, the question is beginning to become moot with the change to the templates - I've downloaded a replacement task since I posted, and confirmed that the file (in batch 887, this time) have a safe size limit. That limit is 210,000,000 bytes (200.27 MB in binary), which is still a bit close - I'd have preferred it if they'd gone up to 250 MB/MiB (either would do!).

Your problem might be helped if you staggered the task starts. I always have CPU tasks from other projects running alongside CPDN tasks: under those conditions, BOINC only ever allocates / downloads one task at a time. CPDN has set the server timeout so that a second request won't be made until a full hour has passed: if you can upload the problem files in less than an hour, that interval should be enough.

[On cue, I can see my second replacement task - also 887 - downloading as I type]

During times of drought, you could let some lightweight project run as a resource share zero backup project. Then, when a new batch is released, each core should transition back to CPDN at hourly intervals. If the batch has uniform runtimes - they usually do - the finishes ought to follow that hourly interval as well.

Small additional bonus - keeping the cores warm while you're waiting should reduce wear and tear from thermal cycling!
ID: 62901 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,024,725
RAC: 20,592
Message 62924 - Posted: 11 Nov 2020, 10:47:58 UTC

Then, there's nearly 20 minutes' grace before the task is declared 'finished': that's the point where any problems would occur. For comparison, the completed run took 196 hours (just over 8 days) on this machine.


just uploaded a 4.zip from one of my tasks - 50 minutes. Other half on a Zoom call during the upload but shouldn't have made that much difference. My old laptop would probably take so long for the bit after the zip is created that it would be fine but the Ryzen it wouldn't stand a chance.
ID: 62924 · Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Unix/Linux : computation error at 100% complete

©2024 cpdn.org