Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The server, which is very elderly, is having difficulties. I saw that red error a few hours ago on a couple of pages. I'll email the project, but it's <sigh> the weekend. There may be downtime associated with fixing this. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Presumably the server status page showing blank is part of the same server problem? As you say Les, "It is a weekend." At least it isn't joined on to a public holiday! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
climateapps2 contains EVERYTHING boinc. Including this forum, so we're lucky at present. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
I have the "Server can't open log file (../log_climateapps2/scheduler.log)" error too. I haven't seen it before either. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
As you say Les, we are lucky at the moment. I noticed two trickles have not gone with the message internet access OK etc. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Hi, It seems my workunit got stuck at the 75% mark and as far as I got I should abort it. However I want to try your suggestions with vm.swapness etc but I use the BOING GUI manager and I'm not sure how to make the changes. Can you give an advice? Thanks |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
It seems my workunit got stuck at the 75% mark This is almost certainly one of the ways these models can crash at the decade points. Go into the graphics and see if it is stuck in a loop, If so the only thing to do is to abort it. You may then have to go into the BOINC data folder and delete the folder for that particular model once it has reported as being aborted if you want to avoid it taking up space. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Show graphics command (button) never worked. But I can say that the work is stuck on 75.761% and restarts the time at one particular time - this is a loop. So most probably it is the "25%" mark issue. I will follow your suggestion. Can you show me where and how to tune "the sysctls vm.swappiness, vm.dirty_background_ratio, and vm.dirty_ratio."? |
Send message Joined: 17 Sep 04 Posts: 9 Credit: 19,604,231 RAC: 296 |
Dear Sir every model ran on my home computer reports an error? can anyone suggest what I should, been running since 2004! steve Show: All | In progress | Completed | Valid | Invalid | Error Task ID click for details Show names Work unit ID click for details Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Claimed credit Granted credit Application 16060662 8607938 7 Oct 2013 21:10:43 UTC 7 Jan 2014 4:37:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07 16060078 8498903 7 Oct 2013 10:37:43 UTC 8 Oct 2013 22:21:05 UTC Error while computing 73.38 0.44 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 16060077 8553663 7 Oct 2013 10:37:43 UTC 6 Jan 2014 18:04:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07 16060069 8608608 7 Oct 2013 10:37:43 UTC 6 Jan 2014 18:04:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07 16055638 8455826 3 Oct 2013 22:36:38 UTC 8 Oct 2013 22:27:56 UTC Error while computing 211,148.33 193,029.90 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07 16046545 8626670 27 Sep 2013 11:32:06 UTC 27 Dec 2013 18:59:17 UTC In progress --- --- 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07 16046297 8626424 27 Sep 2013 13:15:05 UTC 8 Oct 2013 22:27:56 UTC Error while computing 242,333.37 221,644.30 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07 16044336 8624471 27 Sep 2013 10:31:22 UTC 7 Oct 2013 10:37:43 UTC Error while computing 247,812.48 225,412.40 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07 16042451 8622601 30 Sep 2013 21:24:27 UTC 7 Oct 2013 10:37:43 UTC Error while computing 215,726.37 198,771.50 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07 16040051 8620218 3 Oct 2013 21:35:51 UTC 7 Oct 2013 10:37:43 UTC Error while computing 218,947.11 203,252.70 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07 16037500 8617683 27 Sep 2013 9:30:28 UTC 3 Oct 2013 21:35:51 UTC Error while computing 28,158.34 23,817.70 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 16000291 8613251 2 Sep 2013 10:57:22 UTC 2 Dec 2013 18:24:33 UTC In progress --- --- 2,488.32 2,488.32 UK Met Office Coupled Model Full Resolution Ocean v6.07 15998475 8613934 31 Aug 2013 20:42:38 UTC 2 Sep 2013 10:57:22 UTC Error while computing 412.77 296.89 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15998473 8469688 31 Aug 2013 20:42:38 UTC 27 Sep 2013 8:00:17 UTC Error while computing 7,265.02 5,740.59 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15998470 8613873 31 Aug 2013 20:42:38 UTC 3 Oct 2013 22:36:38 UTC Error while computing 32,718.18 27,907.19 311.04 311.04 UK Met Office Coupled Model Full Resolution Ocean v6.07 15998347 8613875 31 Aug 2013 18:23:35 UTC 27 Sep 2013 8:00:17 UTC Error while computing 2,440.82 2,327.43 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15997975 8528069 31 Aug 2013 10:25:14 UTC 31 Aug 2013 20:42:38 UTC Error while computing 49.79 0.06 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15997974 8613961 31 Aug 2013 10:25:14 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15996983 8613807 31 Aug 2013 10:25:14 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 15996847 8613676 31 Aug 2013 14:59:20 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07 |
Send message Joined: 30 Aug 06 Posts: 27 Credit: 1,892,002 RAC: 1,589 |
A while back I changed the various programs on my Windows machines that scan folders (Virus Scans, Indexing) to exclude the WCG directories from the real time scans. I also set BOINC to suspend work when the full disk scans are scheduled. I have had better luck completing models since then. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Steve in Pimlico Many of your tasks have made it quite a long way through before giving up. If you click on the + under STDERR you can see the errors. Those I looked at all showed the same pattern. I suspect most likely something that puts a lock on one of the files when BOINC is trying to write to it. (many antivirus programs do this.) Second possibility is something very intensive being run on the computer. It is worth excluding the BOINC data directory from any virus scans, suspending computation and exiting BOINC before shutting down. also under Tools>disk and memory usage ensure leave applications in memory while suspended is ticked. I am sure I have missed something out. If so someone with a better memory than me will add it soon. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I agree with the above two. Just to expand on Dave's point, after you have finished looking at your antivirus's options, if you are looking on the website, the boinc settings are found in Account / computing options. * Suspend work while computer is in use? no * Suspend work if CPU usage is above 0 % * Leave tasks in memory while suspended? yes Suspended tasks will consume swap space if 'yes' Having these three settings mean that the task will stay in memory rather than being pushed out & reloaded repeatedly. You have plenty of memory, so it should be fine to keep them in memory. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 17 Sep 04 Posts: 9 Credit: 19,604,231 RAC: 296 |
Thank will try this no luck so far |
Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0 |
I've returned to the project after a bit of a gap. A couple of models have errored out, one annoyingly close to the end - so I'll post a clip of the message log to gauge any opinions, I don't know what all the error codes mean. I do most of the hygiene things anyway, though perhaps I should look at exempting Boinc data from MSE a/virus. One model completed successfully and was stuck in the recent upload blockage and one of the others failed just after everything cleared. Just coincidence? My tasks should be set to visible to see the stderr exit files <which I don't understand!>. message log, starting with the successful model completing and reporting: 12/11/2013 07:30:19 climateprediction.net Started upload of hadcm3n_o1km_1980_40_008401621_1_4.zip 12/11/2013 07:30:21 climateprediction.net [error] Error reported by file upload server: Server is out of disk space 12/11/2013 07:30:21 climateprediction.net Temporarily failed upload of hadcm3n_o1km_1980_40_008401621_1_4.zip: transient upload error 12/11/2013 07:30:21 climateprediction.net Backing off 3 hr 30 min 42 sec on upload of hadcm3n_o1km_1980_40_008401621_1_4.zip 12/11/2013 11:01:04 climateprediction.net Started upload of hadcm3n_o1km_1980_40_008401621_1_4.zip 12/11/2013 11:04:50 climateprediction.net Finished upload of hadcm3n_o1km_1980_40_008401621_1_4.zip 12/11/2013 12:59:28 climateprediction.net task hadcm3n_7x75_1980_40_008454308_3 resumed by user 12/11/2013 13:02:59 climateprediction.net Restarting task hadcm3n_7x75_1980_40_008454308_3 using hadcm3n version 607 12/11/2013 21:39:31 climateprediction.net Sending scheduler request: To send trickle-up message. 12/11/2013 21:39:31 climateprediction.net Reporting 1 completed tasks, not requesting new tasks 12/11/2013 21:39:34 climateprediction.net Scheduler request completed 13/11/2013 01:15:22 climateprediction.net Task hadcm3n_o525_1940_40_008380310_2 exited with zero status but no 'finished' file 13/11/2013 01:15:22 climateprediction.net If this happens repeatedly you may need to reset the project. 13/11/2013 01:15:22 climateprediction.net Task hadcm3n_7x75_1980_40_008454308_3 exited with zero status but no 'finished' file 13/11/2013 01:15:22 climateprediction.net If this happens repeatedly you may need to reset the project. 13/11/2013 01:15:23 climateprediction.net Restarting task hadcm3n_o525_1940_40_008380310_2 using hadcm3n version 607 13/11/2013 01:15:24 climateprediction.net Restarting task hadcm3n_7x75_1980_40_008454308_3 using hadcm3n version 607 13/11/2013 01:16:28 climateprediction.net Task hadcm3n_ofqn_1900_40_008475522_1 exited with zero status but no 'finished' file 13/11/2013 01:16:28 climateprediction.net If this happens repeatedly you may need to reset the project. 13/11/2013 01:16:28 climateprediction.net Restarting task hadcm3n_ofqn_1900_40_008475522_1 using hadcm3n version 607 13/11/2013 01:20:33 climateprediction.net Sending scheduler request: To send trickle-up message. 13/11/2013 01:20:33 climateprediction.net Not reporting or requesting tasks 13/11/2013 01:20:37 climateprediction.net Scheduler request completed 13/11/2013 01:20:51 climateprediction.net Computation for task hadcm3n_ofqn_1900_40_008475522_1 finished 13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_3.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent 13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_4.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent 13/11/2013 09:25:47 climateprediction.net Sending scheduler request: To send trickle-up message. 13/11/2013 09:25:47 climateprediction.net Reporting 1 completed tasks, not requesting new tasks 13/11/2013 09:25:50 climateprediction.net Scheduler request completed |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_3.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent That model, ofqn, has crashed between zips 2 & 3. Judging by the trickle list, it was while getting the data ready to zip up (for zip3), to return to the project. In other words, "the 25% problem". (In this case the 75% point.) The error list shows BOINC stopping a lot, indicative of the option: Suspend work if CPU usage is above being still set to the default of 25%. Which is fine for other projects, but not here. These programs DON'T like being interrupted at certain critical points. So it's possible that you started to use the computer at that moment, the cpu load went above 25%, and BOINC, (and the model), stopped. In the case of the model, permanently. |
Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0 |
The error list shows BOINC stopping a lot, indicative of the option: Suspend work if CPU usage is above being still set to the default of 25%. Les, thanks for looking at things. Current setting: "Computing allowed" 1] while computer is in use 2] while processor usage is less than 0 percent I'll change "Only after computer has been idle for" to 0 minutes, it was on 3.00mins. {not entirely sure what this latter setting actually means or really remember why it was on 3.00} also, applications are left in memory on suspend. It's true, there does tend to be a whole lot of stuff running on the pc most times. I do need to get myself a new desktop pc which will share the workload of everything going on! :) Pete |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,014,122 RAC: 399 |
If you have selected "while computer is in use", the setting doesn't matter. Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0 |
Thanks prof. Desty, I sensed some double speak in there... |
Send message Joined: 5 Aug 04 Posts: 127 Credit: 24,517,986 RAC: 17,587 |
"Computing allowed" You're not allowed to set "has been idle for" to zero minutes, even as has already been mentioned this setting isn't used if you don't suspend computing for any reason. While it's possible to manually edit the preference-file (either override or general) and set it to zero, if you do this the client-default is used instead, and this probably is 3 minutes. Some other settings on the other hand does accept zero minutes, and also a little inconsistently zero percent as far as processor-usage means 100%. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,623,821 RAC: 1,846 |
Hi guys, It seems my fourth project attempt got stucked this time at 75% - it is running but no progress on the %. I have a core duo system with 3 GB RAM and have allocated one of the CPUs to be used by BOINC at 90% (not to overheat). I set boinc to run while computer is in use and suspend when CPU usage is above 50%. I do use my computer permanently for work and sometimes I need to shut it down 2-3 times per day. Usually I open it in the morning and shut it down at night. I have checked leave application in memory while suspended. But while working at some point the whole computer becomes slow. So far none of my 4 attempts to complete a task was successfull. I wonder is there any use of my computations at all if not a single task has been completed. If these models are that sensitive isn't possible that the client requests something that can be completed? I tried leaving the computer running longer at 25% and 50% points to pass this tresholds, but at 75 I could not. It computes around 2% per day so it is hard to avoid stopping at tresholds unless I leave the computer runnig for more than 24 h which is rather rare possibility. Any suggestions to overcome this! Cheers |
©2024 cpdn.org