computation error at 100% complete

Author	Message
nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62809 - Posted: 1 Nov 2020, 17:15:27 UTC So I left the machine with 15 mins to go before completion on task hadam4h_b0tg_201211_5_877_012029238_0. When I returned it had failed with computation error. It says that the w/u returned 5 trickles. There was a street wide power failure a couple of weeks ago which shut the machine(s) down but all w/u's resumed ok. I have 3 more due to finish in the next day. It would be a shame if all failed the same way. I am beginning to suspect a shortage of disk space on this machine....... is this what the stderr_txt is actually trying to say?? Ta Nairb <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> 16:13:32 (1925): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_2.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message ID: 62809 · Reply Quote

WB8ILI Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164	Message 62810 - Posted: 1 Nov 2020, 18:20:39 UTC Last modified: 1 Nov 2020, 18:24:25 UTC I have had the same thing with the "file-size too big" message. Maybe it is semi-normal. I am 99.999% sure is is NOT a disk space issue. ID: 62810 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62811 - Posted: 1 Nov 2020, 19:23:23 UTC Last modified: 1 Nov 2020, 19:50:03 UTC I am 99.999% sure is is NOT a disk space issue. I will let the project know. I am almost certain it is a setting in one of the files downloaded for the task that needs increasing. If I can work out which one it is I can post a fix for tasks still running for those who don't mind getting their hands dirty. If all five zips have gone, you will get all the credit. but it would be good to stop the tasks going out again only for the same error to occur at the end. Hopefully the fix can be applied before the 2072 batch for this series goes out. Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000. My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem. ID: 62811 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62812 - Posted: 1 Nov 2020, 19:31:02 UTC I lost 3 from the same batch a week ago, so it looks like a bad batch. :( ID: 62812 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 62813 - Posted: 1 Nov 2020, 19:50:54 UTC Well, this is concerning. Three of my batch 878 tasks have trickled once/uploaded a zip file, and in the upload section, max_nbytes = 150000000 (150 MB) and the uploaded _1.zip files were over 200 MB. ID: 62813 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62814 - Posted: 1 Nov 2020, 22:36:33 UTC Last modified: 1 Nov 2020, 22:37:35 UTC On a separate machine I have just had the same thing happen to hadam4h_b0cw_200811_5_877_012028642_0 <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> 22:21:59 (31404): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0cw_200811_5_877_012028642_0_r803748994_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> ]]> This machine had 11.6gb free disk space. The final zip file was 193.11mb I did not get time to check the <rsc_disk_bound> value ID: 62814 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62815 - Posted: 1 Nov 2020, 22:38:00 UTC - in response to Message 62813. Well, this is concerning. Three of my batch 878 tasks have trickled once/uploaded a zip file, and in the upload section, max_nbytes = 150000000 (150 MB) and the uploaded _1.zip files were over 200 MB. I will try increasing max_nbytes as well. But my 877's are all resends that have just started a few hours ago. ID: 62815 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62816 - Posted: 1 Nov 2020, 23:22:21 UTC - in response to Message 62811. Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000. My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem. Ok, its been done on the 2 machines. There are 3 877's to complete in the next 20 hrs. I do hope they are more successful. ID: 62816 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,946,057 RAC: 13,930	Message 62817 - Posted: 1 Nov 2020, 23:33:10 UTC - in response to Message 62812. I had 2 ffrom batch 877 that have finished OK with no errors. Third one on the go along with 2 from 878. Finish in about 12days time! ID: 62817 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62818 - Posted: 2 Nov 2020, 1:36:28 UTC - in response to Message 62817. That's good to hear. I'll pass that along too. This batch must have some members that are set to produce just enough extra data to cause problems. ID: 62818 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275	Message 62819 - Posted: 2 Nov 2020, 3:00:07 UTC - in response to Message 62818. I had 10 out of 10 batch 877 tasks finish successfully. ID: 62819 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62820 - Posted: 2 Nov 2020, 9:40:11 UTC - in response to Message 62816. Last modified: 2 Nov 2020, 10:06:49 UTC looks like it is max_nbytes rather than the other one. I am not sure whether the value needs to be increased for zips1-5, out.zip and restart.zip which would mean 7 changes for each task. Not sure I want to do that for 11 tasks. -one mistake could potentially knock out a lot of work. Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it. The project are thinking about different options i.e. just changing things on tasks still to go out, aborting tasks in progress and resending rather than wasting crunching time etc. ID: 62820 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 62821 - Posted: 2 Nov 2020, 11:57:45 UTC The individual file sizes, and the total disk usage, are different measures, and are treated differently. From the BOINC code, talking about the individual files: // Note: this is only checked when the application finishes. // The total disk space is checked while the application is running. So, the intermediate result files can upload without problems while the app is running, but any left over at the end may cause it to fail with ERR_FILE_TOO_BIG. The intermediate files are certainly too big for the current run: <file> <name>hadam4h_c0ap_206511_5_878_012030243_0_r1407367359_1.zip</name> <nbytes>202200658.000000</nbytes> <max_nbytes>150000000.000000</max_nbytes> <md5_cksum>9a24fd25f8124d69316920a39abb2f31</md5_cksum> <status>0</status> <uploaded/> <upload_url>http://upload3.cpdn.org/cgi-bin/file_upload_handler</upload_url> </file> but as you can see, that one went through OK. Towards the end of the run, the app will create a fifth intermediate zip file. If you can, it would be wise to allow that one to finish uploading before the task completes. Two more files are created at the very end - a restart.zip and an out.zip. I believe those files are significantly smaller, and should cause no problems at all. ID: 62821 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62822 - Posted: 2 Nov 2020, 12:02:38 UTC - in response to Message 62820. Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it. Its been done..... it does seem that only some w/u are affected. Better to find out its not a machine problem tho. Ta Nairb ID: 62822 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62825 - Posted: 2 Nov 2020, 14:23:59 UTC Last modified: 2 Nov 2020, 15:20:36 UTC All tasks waiting to go have been withdrawn. Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted. ID: 62825 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 62826 - Posted: 2 Nov 2020, 15:41:13 UTC - in response to Message 62825. Last modified: 2 Nov 2020, 15:58:03 UTC Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted. Good decision (for me at any rate). I have updated all three of my machines before the first trickle. Only one was running 878, and the other two 879 (if it is affected). They should be good to go. ID: 62826 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62829 - Posted: 2 Nov 2020, 19:27:02 UTC Well the fix had been applied to the client_state.xml. Both the max-nbytes & rsc_disk_bound. All those lines that needed changing. Maybe I made an error but the next w/u to finish just now also failed with.... <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... 19:06:44 (1926): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0ft_200911_5_877_012028747_0_r1809679561_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> ]]> There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well...... ID: 62829 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,970,055 RAC: 21,846	Message 62830 - Posted: 2 Nov 2020, 20:07:49 UTC - in response to Message 62829. There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well.. Doubling the size is more than the increase being applied before they are re-released. I don't know enough to work out whether there is a difference between changing things in client_state.xml and the files that get sent from the server. Logically I can't see why there should be a difference but not having programmed since the days of ALGOL..... Given that 14% have completed even though they show errors it looks like something else is going on. It is another 8 days till mine finish even on a Ryzen7. It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client. ID: 62830 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 62831 - Posted: 2 Nov 2020, 20:52:12 UTC - in response to Message 62829. I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties. Which doesn't matter so much. ID: 62831 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 62832 - Posted: 2 Nov 2020, 20:53:14 UTC - in response to Message 62830. It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client. No,no, good question. I dident think to quit the client. So I just checked and the client_state.xml was as without the changes. So I quit the client................. made the changes again. and restarted. 2 other w/u error(ed) straight away and died. They had only!! been running a day or so. The w/u that is due to complete soon restarted ok. I rechecked the client_state.xml and the changes were still there. So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2. I now just pull the power core out. I will report the outcome of the remaining 877 with 1hr 10mins left to run The 2 failed w/u after the restart showed hadam4h_c0ds_206511_5_878_012030354_0 <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process got signal 65</message> <stderr_txt> Signal 2 received: Interrupt Signal 2 received: Illegal instruction - invalid function image Signal 2 received: Floating point exception Signal 2 received: Segment violation Signal 2 received: Software termination signal from kill Signal 2 received: Abnormal termination triggered by abort call Signal 2 received, exiting... 20:14:24 (1310): called boinc_finish(193) Signal 2 received: Interrupt Signal 2 received: Illegal instruction - invalid function image Signal 2 received: Floating point exception Signal 2 received: Segment violation Signal 2 received: Software termination signal from kill Signal 2 received: Abnormal termination triggered by abort call Signal 2 received, exiting... 20:14:25 (983): called boinc_finish(193) </stderr_txt> along with loads of messages like 02-Nov-2020 20:32:07 [climateprediction.net] Output file hadam4h_c0ds_206511_5_878_012030354_0_r381159273_4.zip for task hadam4h_c0ds_206511_5_878_012030354_0 absent ID: 62832 · Reply Quote