Questions and Answers : Unix/Linux : computation error at 100% complete
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
So I left the machine with 15 mins to go before completion on task hadam4h_b0tg_201211_5_877_012029238_0. When I returned it had failed with computation error. It says that the w/u returned 5 trickles. There was a street wide power failure a couple of weeks ago which shut the machine(s) down but all w/u's resumed ok. I have 3 more due to finish in the next day. It would be a shame if all failed the same way. I am beginning to suspect a shortage of disk space on this machine....... is this what the stderr_txt is actually trying to say?? Ta Nairb <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> 16:13:32 (1925): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_2.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
I have had the same thing with the "file-size too big" message. Maybe it is semi-normal. I am 99.999% sure is is NOT a disk space issue. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
I am 99.999% sure is is NOT a disk space issue. I will let the project know. I am almost certain it is a setting in one of the files downloaded for the task that needs increasing. If I can work out which one it is I can post a fix for tasks still running for those who don't mind getting their hands dirty. If all five zips have gone, you will get all the credit. but it would be good to stop the tasks going out again only for the same error to occur at the end. Hopefully the fix can be applied before the 2072 batch for this series goes out. Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000. My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I lost 3 from the same batch a week ago, so it looks like a bad batch. :( |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Well, this is concerning. Three of my batch 878 tasks have trickled once/uploaded a zip file, and in the upload section, max_nbytes = 150000000 (150 MB) and the uploaded _1.zip files were over 200 MB. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
On a separate machine I have just had the same thing happen to hadam4h_b0cw_200811_5_877_012028642_0 <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> 22:21:59 (31404): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0cw_200811_5_877_012028642_0_r803748994_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> ]]> This machine had 11.6gb free disk space. The final zip file was 193.11mb I did not get time to check the <rsc_disk_bound> value |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Well, this is concerning. I will try increasing max_nbytes as well. But my 877's are all resends that have just started a few hours ago. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Ok, its been done on the 2 machines. There are 3 877's to complete in the next 20 hrs. I do hope they are more successful. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,995,778 RAC: 14,325 |
I had 2 ffrom batch 877 that have finished OK with no errors. Third one on the go along with 2 from 878. Finish in about 12days time! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That's good to hear. I'll pass that along too. This batch must have some members that are set to produce just enough extra data to cause problems. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I had 10 out of 10 batch 877 tasks finish successfully. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
looks like it is max_nbytes rather than the other one. I am not sure whether the value needs to be increased for zips1-5, out.zip and restart.zip which would mean 7 changes for each task. Not sure I want to do that for 11 tasks. -one mistake could potentially knock out a lot of work. Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it. The project are thinking about different options i.e. just changing things on tasks still to go out, aborting tasks in progress and resending rather than wasting crunching time etc. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
The individual file sizes, and the total disk usage, are different measures, and are treated differently. From the BOINC code, talking about the individual files: // Note: this is only checked when the application finishes. // The total disk space is checked while the application is running. So, the intermediate result files can upload without problems while the app is running, but any left over at the end may cause it to fail with ERR_FILE_TOO_BIG. The intermediate files are certainly too big for the current run: <file> <name>hadam4h_c0ap_206511_5_878_012030243_0_r1407367359_1.zip</name> <nbytes>202200658.000000</nbytes> <max_nbytes>150000000.000000</max_nbytes> <md5_cksum>9a24fd25f8124d69316920a39abb2f31</md5_cksum> <status>0</status> <uploaded/> <upload_url>http://upload3.cpdn.org/cgi-bin/file_upload_handler</upload_url> </file>but as you can see, that one went through OK. Towards the end of the run, the app will create a fifth intermediate zip file. If you can, it would be wise to allow that one to finish uploading before the task completes. Two more files are created at the very end - a restart.zip and an out.zip. I believe those files are significantly smaller, and should cause no problems at all. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it. Its been done..... it does seem that only some w/u are affected. Better to find out its not a machine problem tho. Ta Nairb |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
All tasks waiting to go have been withdrawn. Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted. Good decision (for me at any rate). I have updated all three of my machines before the first trickle. Only one was running 878, and the other two 879 (if it is affected). They should be good to go. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Well the fix had been applied to the client_state.xml. Both the max-nbytes & rsc_disk_bound. All those lines that needed changing. Maybe I made an error but the next w/u to finish just now also failed with.... <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... 19:06:44 (1926): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>hadam4h_b0ft_200911_5_877_012028747_0_r1809679561_5.zip</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> ]]> There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well...... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well.. Doubling the size is more than the increase being applied before they are re-released. I don't know enough to work out whether there is a difference between changing things in client_state.xml and the files that get sent from the server. Logically I can't see why there should be a difference but not having programmed since the days of ALGOL..... Given that 14% have completed even though they show errors it looks like something else is going on. It is another 8 days till mine finish even on a Ryzen7. It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties. Which doesn't matter so much. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
No,no, good question. I dident think to quit the client. So I just checked and the client_state.xml was as without the changes. So I quit the client................. made the changes again. and restarted. 2 other w/u error(ed) straight away and died. They had only!! been running a day or so. The w/u that is due to complete soon restarted ok. I rechecked the client_state.xml and the changes were still there. So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2. I now just pull the power core out. I will report the outcome of the remaining 877 with 1hr 10mins left to run The 2 failed w/u after the restart showed hadam4h_c0ds_206511_5_878_012030354_0 <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process got signal 65</message> <stderr_txt> Signal 2 received: Interrupt Signal 2 received: Illegal instruction - invalid function image Signal 2 received: Floating point exception Signal 2 received: Segment violation Signal 2 received: Software termination signal from kill Signal 2 received: Abnormal termination triggered by abort call Signal 2 received, exiting... 20:14:24 (1310): called boinc_finish(193) Signal 2 received: Interrupt Signal 2 received: Illegal instruction - invalid function image Signal 2 received: Floating point exception Signal 2 received: Segment violation Signal 2 received: Software termination signal from kill Signal 2 received: Abnormal termination triggered by abort call Signal 2 received, exiting... 20:14:25 (983): called boinc_finish(193) </stderr_txt> along with loads of messages like 02-Nov-2020 20:32:07 [climateprediction.net] Output file hadam4h_c0ds_206511_5_878_012030354_0_r381159273_4.zip for task hadam4h_c0ds_206511_5_878_012030354_0 absent |
©2024 cpdn.org