Message boards : Number crunching : Where do all the errors come from?
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
This may be a stupid question. A model is running, with Network Activity Suspended. Backups are being taken regularly. It crashes, as they do occasionally. A Backup is restored and crunching recommences. Networking is only ever turned on, briefly, to let a trickle or decadal zip file upload and then turned off again. How then does the Server come to know about errors and list them on the model\'s Result page, given that the crash occurred, the Backup restored and recovery obtained all without communication with the network? Puzzled. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Are the reported errors the ones which led to the fatalities? Or the usual litany we see from recent boinc versions? There should be no way for \'knowledge\' of an error/crash to carry over when a backup is restored. Copies of stderr/stdoutdae.txt and client_state.xml are returned to their pre-crash condition... Can you point us to a specific case? Edit: I\'m assuming that you mean \'restore the entire boinc folder\' when you say restore Backups. (Piecemeal attempts to \'restore\', problematic at best, could leave old Trickles...) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I guess it may be this result: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6622010 How are you restoring from the backup, are you copying the backup over the top of the normal folder, or are you renaming folders? The reason I ask is that if you simply copy the backup folder over the top, files which exist in the original boinc folder, but not in the backup folder are left intact. This includes result uploads which tell the servers that the model crashed. I prefer to rename folders so that they stay separate. Of course, it doesn\'t actually matter whether the server thinks the model crashed or not, because the trickle uploads are the important thing from the point of view of the scientists (the model\'s status is ignored). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
In my how-to-backup, there are 2 things that are important: 1. If copying \'over the top\' of the original, ALWAYS DELETE THE ORIGINAL FIRST. 2. ALWAYS re-boot the computer afterwards; otherwise you\'ll still have all of the old info stored in ram, and some may be faulty. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Yes, http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6622010 is the one that causes me to ask the question. How do I do it? I close down BOINC (File > Exit) I delete all files and folders from c:\\BOINC I copy all files and folders from my backup copy and paste them into c:\\BOINC I confess I have not been rebooting at this point I simply restart BOINC |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I\'d make the following modifications to your procedure...
Rename BOINC to BOINC_old
Copy - and - paste your backup directory to c:\\, then rename it to BOINC
I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
A couple of years back, I too wasn\'t rebooting, and couldn\'t work out why a string of backups weren\'t working. (I keep them on a different partition, and just keep adding to them until space gets short before deleting VERY old backups.) Then I worked out that BOINC, (\"kept in memory\"), must be able to use this kept data when restarted. So I rebooted to flush the \'bad\' data, and the backups all started working. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
That\'s v interesting and, though it seems to defy some logic, I\'ll follow that course. I assume \"rebooting\" means rebooting the whole machine, not just BOINC and its apps? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Yup. I usually reboot each time I backup Boinc simply because my PC starts to go slow and then crashes if it is up too long (I think it\'s due to a memory leak in VSMON - my virus checker). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 29 Oct 07 Posts: 4 Credit: 39,104 RAC: 0 |
I am confused. You all seem to know your BOINC and climate prediction very well..... My question is a simple one... Why does it say computation error? I did 75 hours on one 825 hour file and it simply said computation error.... Is it my pc or simply the file that a received was flawed? Also is there any chance of over heating on my pc if I leave it on all the time with boinc running at 50% of CPU time in the background? I have a 2.8 Dell dimension 92000 with 4 gb of ram. THANKS! |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Well, I had a lengthy reply composed, hit \'Post reply\' -- and next saw the download page. Mumble, mumble. (Let\'s hear it for the boinc BB!) Without redoing the entire thing (as though I could): You have 3GB of memory. Not usual. Likely 2*1GB plus 2*512MB. Same manufacturer? Same timings? Vista: Where is your boinc folder? If in C:\\Program Files, that\'s a problem. Please put it anywhere else. D:\\boinc would be good. (I format my hard disks to give boinc its own Partition. This has advantages.) You lost two Models with similar errors. \'22\' is a catch-all error and tells us nothing useful. Did you install the latest graphics drivers? Is the box overclocked (not sure it can be done on a Dell)? Do you run heavy-resource progs like games or video editing with boinc/CPDN active? I forget whet else I wrote... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Overheating. Hmmm. Dell - made for a price. Probably a minimum of case cooling, perhaps just the power-supply-unit fan. So minimum air flow through the case to cool the processor. Living in Canade, so room temps should be getting low. Which leaves the cpu heatsink: is it dust free? Dust acts as an insulator, so the processor heat can\'t escape. But lots of us leave our computers on 24/7, running full processor power. :) As for the models that failed: this can be because of something going wrong with the computer, or it can indeed be the dataset for the model; sometimes the combination of values used for the model can result in it becoming unstable, so the model will then crash. One such instance is the well-known \"Negative pressure\". This failure before the end target year is part of the experiment; the researchers want to know which combinations cause it, and there\'s only one way to find out. PS Bad luck Astro. I keep thinking that I should \"Ctrl C\" my posts first, but I always forget. Backups: Here |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
This bit was buried at the far end of the error log. While it\'s not very clear, it might be due to the CPU being used by something else for an extended period of time. CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3156, iMonCtr=1 This is the kind of error which backups can solve (see the \'backups and restores\' readme via the link in my signature). If you are going to be running something which uses the CPU on the PC for an extended period of time (such as a game, video compression / ripping, etc), then I\'d suggest exiting from Boinc first. Just right-click on the icon and select \'exit\'. I\'d also recommend scanning through the other readmes at the same time to see if there is anything of interest. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Thanks, MikeMars. Yes, I\'m pretty disciplined about frequency of taking backups and am aware of the need to exit BOINC when doing mill-intensive stuff, but haven\'t a clue about what could have caused that specific error message. Ah well, some things are intended to remain a mystery, I suppose. |
Send message Joined: 29 Oct 07 Posts: 4 Credit: 39,104 RAC: 0 |
Well, I had a lengthy reply composed, hit \'Post reply\' -- and next saw the download page. Mumble, mumble. (Let\'s hear it for the boinc BB!) Hey, I had another computation error. This is what my screen says: 21/11/2007 1:49:36 AM||General prefs: using your defaults 21/11/2007 1:49:36 AM||Reading preferences override file 21/11/2007 1:49:36 AM||Preferences limit memory usage when active to 1534.57MB 21/11/2007 1:49:36 AM||Preferences limit memory usage when idle to 1534.57MB 21/11/2007 1:49:36 AM||Preferences limit disk usage to 27.77GB 21/11/2007 1:52:52 PM|climateprediction.net|Deferring communication for 1 min 0 sec 21/11/2007 1:52:52 PM|climateprediction.net|Reason: Unrecoverable error for result hadcm3iozn_cpnx_2000_80_125899030_2 (The device does not recognize the command. (0x16) - exit code 22 (0x16)) 21/11/2007 1:52:54 PM|climateprediction.net|Computation for task hadcm3iozn_cpnx_2000_80_125899030_2 finished 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_1.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_2.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_3.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_4.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_5.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_6.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_7.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 1:52:54 PM|climateprediction.net|Output file hadcm3iozn_cpnx_2000_80_125899030_2_8.zip for task hadcm3iozn_cpnx_2000_80_125899030_2 absent 21/11/2007 6:39:36 PM||Resuming network activity 21/11/2007 6:39:36 PM|climateprediction.net|Sending scheduler request: Requested by user 21/11/2007 6:39:36 PM|climateprediction.net|Requesting 30240 seconds of new work, and reporting 1 completed tasks 21/11/2007 6:39:41 PM|climateprediction.net|Scheduler RPC succeeded [server version 509] 21/11/2007 6:39:44 PM|climateprediction.net|[file_xfer] Started download of file hadsm3fub_0332_005911804.zip 21/11/2007 6:39:45 PM|climateprediction.net|[file_xfer] Finished download of file hadsm3fub_0332_005911804.zip 21/11/2007 6:39:45 PM|climateprediction.net|[file_xfer] Throughput 16009 bytes/sec 21/11/2007 6:39:46 PM|climateprediction.net|Starting hadsm3fub_0332_005911804_6 21/11/2007 6:39:46 PM|climateprediction.net|Starting task hadsm3fub_0332_005911804_6 using hadsm3 version 506 21/11/2007 6:44:36 PM||Suspending network activity - user request 21/11/2007 10:01:31 PM||Suspending computation - user request 22/11/2007 1:00:17 AM||Resuming computation I did what some of you guys said: I changed where my BOINC is located. Now it has my G drive that has some 70 GB of space of which climate prediction takes up to 1.3 GB. I normally leave it on all the time since I never close my pc. I set it to 50% cpu time usage to avoid over heating. This type of error never happend with SETI or Rosetta. In fact I did some 30000 hours with rosetta straight never shuting down my pc and I never once had a problem or heating. The stats of my pc are the following: Manufacturer: Dell Model: Dimension DXP061 Windows experience index rating: 5.5 Processor: Intel(R) Core(TM) 2 Quad CPU 2.40 GHZ 2.39 GHZ Memory ram: 3070 MB (however physically I have 4 GB installed but Vista can only see 3 GB max). System type: 32-bit operating system Windows edition: Windows Vista Home Premium So my questions are: 1) what causes these computation errors? 2) What can I do to fix them (since I have already allocated a disk for it) 3) Why does it never happen with others like Rosetta and SETI any how I hope some one can help me... if this continues i will most likely drop climate prediction and continue with rosetta and seti alone. THANKS all! |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
The most recent one was this: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6965692 CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=4968, iMonCtr=1 There as a similar one a day before: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6965694 A different one (16th Nov) was this: exit code -1073741502 (0xc0000142) The latter two crashes seemed to happen in the late evening, UK time. Do you recall what was happening on the PC at those times, perhaps a game or something else which uses 100% of CPU time for an extended period? The 0xc0000142 happens when the PC is about to crash, and can\'t start any new processes. The very latest version of the Boinc Manager, 5.10.30 (in testing, not released) handles the C0000142 error better. 1) Possibly due to games or other stuff running at the same time, or out of date drivers for your motherboard graphics? There are other possible causes. 2) Try right-clicking and selecting \'exit\' on the boinc icon before playing games, doing anything else which uses 100% of CPU time such as video encoding, or shutting down the system. Also try disabling the Boinc screensaver, and see if you can find an update for the graphics drivers on your PC (you\'ll need to look on Dell\'s website). 3) Firstly, if one in a hundred Rosetta jobs was failing, you\'d probably not notice - because the climate project runs so much longer, a single failure is nore obvious. However, the climate model is more sensitive than other Boinc tasks about applications which use 100% of the CPU time at normal priority. I\'d recommend that you have a read through the \'READMEs\' to see if there is anything which looks relevant (link in my signature). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 29 Oct 07 Posts: 4 Credit: 39,104 RAC: 0 |
The most recent one was this: SHIT a remember what it was. My brother plays counter strike in those time frames. thats why the cpu may be generating errors for BOINC. I will tell him to exit boinc when he plays his game. Thanks a lot. I think now it should all be ok. SOrry of the numerous long posts and my slowness at understanding! |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Glad that you\'ve managed to solve the mystery :-) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 31 Aug 04 Posts: 42 Credit: 15,308,708 RAC: 298 |
Well ya\'ll didn\'t make me feel all warm and fuzzy about solving my errors... let\'s take a run at it. Today (after a few days of winding down WUs on the machine) I formated the HDD and installed Xubuntu v7.10 (64bit) on this machine. It\'s been running WinXP for some time with multiple projects on it. It\'s an AMD X2 4200+ with 2 x 256MB of PC4000 RAM. It\'s a dedicated number cruncher as is normally \"headless\". I installed the v5 stdc++ libs (Gutsy comes with v6) required by several project apps (QMC, E&H, Lieden, WCG and perhaps CPDN). Use the package install to get the AMD64 version of BOINC v5.10.8 up and running as a daemon. I encountered these errors: Sat 01 Dec 2007 06:24:50 PM CST|QMC@HOME|Reason: Unrecoverable error for result three_ad_anthracene.3996_0 (process exited with code 22 (0x16, -234)) Sat 01 Dec 2007 06:25:47 PM CST|climateprediction.net|Reason: Unrecoverable error for result hadsm3fub_0107_005913005_1 (process exited with code 22 (0x16, -234)) Sat 01 Dec 2007 06:25:51 PM CST|Einstein@Home|Reason: Unrecoverable error for result h1_0666.20_S5R2__265_S5R3a_1 (process exited with code 22 (0x16, -234)) Sat 01 Dec 2007 06:25:53 PM CST|World Community Grid|Reason: Unrecoverable error for result dddt0201k0629_ZINC06913243-0000_00_0 (process exited with code 22 (0x16, -234)) One thing that makes me think it\'s app dependent is that one of WCGs other apps runs fine. Any thoughts? PS: I see \"execv: No such file or directory\" in this returned result http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7013096 - da shu @ HeliOS, "Free software is a matter of liberty, not price. To understand the concept, you should think of free as in free speech, not as in free beer" |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
©2024 cpdn.org