Thread 'Unrecoverable error after 4100 hours :('

Author	Message
old_user35834 Send message Joined: 12 Jan 05 Posts: 12 Credit: 40,824 RAC: 0	Message 27289 - Posted: 12 Mar 2007, 14:35:11 UTC Last modified: 12 Mar 2007, 14:49:24 UTC My work unit errored out after over 4100 hours and almost 80% complete. That\'s really too bad. What might have caused this? The machine was dedicated only to this project. No task switching, no new version installed, no Windows update. Just crunching happily and then the error :( See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5276332 BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair ID: 27289 · Reply Quote

old_user35834 Send message Joined: 12 Jan 05 Posts: 12 Credit: 40,824 RAC: 0	Message 27290 - Posted: 12 Mar 2007, 14:44:50 UTC Last modified: 12 Mar 2007, 14:50:34 UTC What are the chances of success if I restore a backup from that directory? It has trickled about an hour before the error, and the backup is at least a day old. So the same trickle would be sent more than once to the server. Any idea what would happen? BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair ID: 27290 · Reply Quote

Strathpeffer Send message Joined: 9 Jan 07 Posts: 497 Credit: 342,899 RAC: 0	Message 27291 - Posted: 12 Mar 2007, 14:49:52 UTC Last modified: 12 Mar 2007, 14:51:04 UTC The server would just ignore the repeated trickle and would start acknowledging trickles again from the first new one. (Says one who has restored from backups often!) However, if it crashes again in the same place, you\'ve probably got a looper :-( and should abandon it. Visit the Scotland team ID: 27291 · Reply Quote

old_user35834 Send message Joined: 12 Jan 05 Posts: 12 Credit: 40,824 RAC: 0	Message 27292 - Posted: 12 Mar 2007, 15:00:17 UTC - in response to Message 27291. The server would just ignore the repeated trickle and would start acknowledging trickles again from the first new one. (Says one who has restored from backups often!) However, if it crashes again in the same place, you\'ve probably got a looper :-( and should abandon it. Thanks for reply. But it turns out this was a hypothetical question. I checked the backup. As this machine is no longer in production, it is also no longer included in our backup cycle. Pity. ID: 27292 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 27294 - Posted: 12 Mar 2007, 16:37:54 UTC This error could have been caused by shutting down the computer without exiting from boinc first, or perhaps if you had a freeze-up and had to shut boinc down using Task Manager. It can sometimes mean your graphics card needs an update. To avoid future crashes I\'d have a look at the READMEs in my sig: In the README about Running the model, the top tips In the one about crashes, items #5 by Mike and #6 by Thyme Lawn Same crashes README - item #1 by Les gives an easy manual backup method that you could use in future. There are extra methods in the README specifically about backups. One can\'t rely on automatic backups of one\'s whole hard drive or of a roomful of computers at one\'s place of work if they\'re carried out while the model\'s running. Backups made while the model\'s running can\'t be successfully restored. Congratulations on completing so much of the model. What you\'ve crunched will be used by the researchers. Cpdn news* ID: 27294 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 27514 - Posted: 26 Mar 2007, 1:44:38 UTC Crash at 97.75% / 278 days :Ã‚Â´( Tutta, you\'re not alone ;-) I still have one coupled model going at 97,27% now - no idea how many days, BOINC lost track when it moved from a dual Xeon to a dual P3s. If that one works, it will be my 1st HadCM3 full run - had a few HadSM3 and Sulphurs full runs but a coupled one is not in my collection yet. ID: 27514 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 27518 - Posted: 26 Mar 2007, 7:33:48 UTC I\'ve got my fingers crossed for your model :-) 97% does count as \'completed\' from the scientists viewpoint since virtually all of the key data has already been uploaded by that point. So congratulations for that one too, although I understand completely why you\'d prefer to be at 100%! I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 27518 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 27529 - Posted: 26 Mar 2007, 18:06:03 UTC Last modified: 26 Mar 2007, 18:08:07 UTC It ignores duplicate trickles btw., they do not even cause trouble, if they come from a different machine than the one the model has been delivered to. Thanks for the head up :-) A full run at CPDN is something special, yes - and at least once I want to see the 100% ID: 27529 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 27693 - Posted: 2 Apr 2007, 17:22:27 UTC It\'s done, that 50% Xeon / 50% P3s model, it\'s a full run !!! That\'s better than ;-) ID: 27693 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 27694 - Posted: 2 Apr 2007, 17:25:13 UTC Congratulations :-) (And you won\'t even be busted by the cops) I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 27694 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 27695 - Posted: 2 Apr 2007, 17:31:19 UTC Last modified: 2 Apr 2007, 17:31:56 UTC Well done! And possibly the straightest graph I\'ve ever seen. Cpdn news ID: 27695 · Reply Quote

mray Send message Joined: 30 Apr 06 Posts: 8 Credit: 10,884,632 RAC: 1,681	Message 27883 - Posted: 14 Apr 2007, 1:54:15 UTC I just had one end with a computation error when it was in the mid 90s percentile. My last backup was at 83%. It was reporting some kind of file I/O errors. I had to really struggle to get the last one to complete, many restores. This time it\'s to far back and I abandoned it. My last WU was on a network drive, but I had this one on a local drive. I have no idea what caused the errors this time around. This was the first error it had, made it all the way to the mid 90s with no problem, then BANG. I\'m abandoning the project for now, maybe forever. They need to work on this thing more to break the WUs up more, a single run takes far to long right now and there are too many possibilities for errors. Backing it up myself is not a viable solution, BOINC is not supposed to require user intervention. ID: 27883 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 27885 - Posted: 14 Apr 2007, 8:14:27 UTC Given that the model uploads it\'s climate as it goes (summary at each year, more detailed summary at the model decades, and a full restart dump at 1960, 2000, and 2040), it doesn\'t really matter a huge amount if it doesn\'t reach it\'s end. Not only that, but your model reached 2071 - which is as good as done as far as the scientists are concerned. It will already have been added to the count of completed models on the front page (successfully completed = 2050 as far as the first page is concerned). And finally - the reason it was stopped was that it was aborted 10 minutes before you posted. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6206053 <core_client_version>5.4.11</core_client_version> <message> aborted by user </message> <stderr_txt> I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 27885 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 27888 - Posted: 14 Apr 2007, 9:09:45 UTC And if (or when) you do come back, remember to ask about the automatic, set-and-forget, backup system. ID: 27888 · Reply Quote

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 27894 - Posted: 14 Apr 2007, 12:53:53 UTC :-) Les! not fair :-) if there is such a method for back up :-( at least let us know ;-) i at least am not leaving but on such Long WU\'s :-) i do tend to get night mares ;-) LoL Regards Masud. ID: 27894 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 27895 - Posted: 14 Apr 2007, 13:13:30 UTC Last modified: 14 Apr 2007, 13:16:04 UTC There\'s a backup-and-restore page here, which discusses links to loads of different ways of doing backups. I tend to stick to the \'old school\' way of shutting down boinc, copying the boinc folder, and restarting it, at weekly intervals, but a lot of people use more sophisticated methods! The thread is here: http://www.climateprediction.net/board/viewtopic.php?t=5895 The automatic backup programme is here: http://bbc.cpdn.org/forum_thread.php?id=2748 It was written by RRodway, a participant originally on the BBC/CCE variant of the project (which is why you\'ll see \'BBC\' all over it). Sometimes a backup just won\'t help - for example, if a model crashes with a \'NEGATIVE PRESSURE DETECTED\' error, it\'ll almost always crash at the same spot again if you restore it. But in this case, it means that the model has reached as far as it\'ll ever go. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 27895 · Reply Quote

mray Send message Joined: 30 Apr 06 Posts: 8 Credit: 10,884,632 RAC: 1,681	Message 27907 - Posted: 14 Apr 2007, 20:38:10 UTC - in response to Message 27885. And finally - the reason it was stopped was that it was aborted 10 minutes before you posted. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6206053 <core_client_version>5.4.11</core_client_version> <message> aborted by user </message> <stderr_txt> I restored from backup and then aborted that, probably not the best idea but I already deleted the one that had the computation failure. I don\'t think it made much difference though, but it may affect stats concerning computation errors vs user aborted. ID: 27907 · Reply Quote

old_user201021 Send message Joined: 30 Sep 06 Posts: 18 Credit: 93,623 RAC: 0	Message 28390 - Posted: 3 May 2007, 4:08:33 UTC Maybe it was corruption caused by Cosmic Rays. What is your elevation there? ID: 28390 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 28542 - Posted: 9 May 2007, 0:10:09 UTC Last modified: 9 May 2007, 0:11:49 UTC A team mate just posted (in the team forum), that his model crashed after shutting down / restarting his computer. When I checked the result, I found \"aborted by user,, CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Abort request from BOINC...\" CC is 5.8.8 Is there some known issue with CPDN and shutdowns or is it cosmic rays again? resultid=6513196 He might not have stopped BOINC before he shut down but it would be good if a model survived that. ID: 28542 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28543 - Posted: 9 May 2007, 0:55:36 UTC Last modified: 9 May 2007, 1:01:13 UTC Hi Ananas Yes, there\'s a big issue with shutting down the computer. Your friend needs to exit from boinc first by right-clicking on the B icon and selecting Exit. Then wait for the icon to disappear, then begin the shutdown process. Sometimes a model survives if boinc isn\'t exited, sometimes it doesn\'t. However, the model\'s results page doesn\'t show that. It shows \'aborted by user\': http://bbc.cpdn.org/show_user.php?userid=156404 All your friend\'s models are quickly crashing. Eg I think he must have run an AV scan without exiting from boinc. The scans locked a file and crashed these: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6497343 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6492724 http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6465859 and maybe this one though it crashed with a 107 code which is different from the others http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6458633 and he aborted his first model: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6447344 His computer looks good - dual-core, lots of memory. But I don\'t think he has any idea how to keep his models safe. Please explain to him that he mustn\'t abort his models or he\'ll never finish one. Would it be possible for you to translate the following items into German for your team members and put them in a sticky on your forum? Go to the project READMEs through my signature. Running the model README - the top tips README about avoiding crashes - you\'d need to make a summary of item #5 by Mike Crashes README again - item #1 by Les about how to back models up. That\'s the essential info that cpdn crunchers need. If you did translate that, it would be a good idea if you could post the German translation here so that Saenger, Tomcat and the many other German team members can copy it to their forums for people whose English isn\'t so good. I\'m sure they would do this. There are several German forums with cpdn crunchers. If you can do that, it would be better to post a translation & discuss it in a separate thread which could all be in German. With a thread title in German to attract the people who need to read it. Cpdn news* ID: 28543 · Reply Quote