climateprediction.net (CPDN) home page
Thread 'Unrecoverable error after 4100 hours :('

Thread 'Unrecoverable error after 4100 hours :('

Message boards : Number crunching : Unrecoverable error after 4100 hours :(
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profileold_user35834
Avatar

Send message
Joined: 12 Jan 05
Posts: 12
Credit: 40,824
RAC: 0
Message 27289 - Posted: 12 Mar 2007, 14:35:11 UTC
Last modified: 12 Mar 2007, 14:49:24 UTC

My work unit errored out after over 4100 hours and almost 80% complete. That\'s really too bad. What might have caused this? The machine was dedicated only to this project. No task switching, no new version installed, no Windows update. Just crunching happily and then the error :(

See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5276332

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 27289 · Report as offensive     Reply Quote
Profileold_user35834
Avatar

Send message
Joined: 12 Jan 05
Posts: 12
Credit: 40,824
RAC: 0
Message 27290 - Posted: 12 Mar 2007, 14:44:50 UTC
Last modified: 12 Mar 2007, 14:50:34 UTC

What are the chances of success if I restore a backup from that directory? It has trickled about an hour before the error, and the backup is at least a day old. So the same trickle would be sent more than once to the server. Any idea what would happen?

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 27290 · Report as offensive     Reply Quote
ProfileStrathpeffer
Avatar

Send message
Joined: 9 Jan 07
Posts: 497
Credit: 342,899
RAC: 0
Message 27291 - Posted: 12 Mar 2007, 14:49:52 UTC
Last modified: 12 Mar 2007, 14:51:04 UTC

The server would just ignore the repeated trickle and would start acknowledging trickles again from the first new one. (Says one who has restored from backups often!) However, if it crashes again in the same place, you\'ve probably got a looper :-( and should abandon it.
Visit the Scotland team
ID: 27291 · Report as offensive     Reply Quote
Profileold_user35834
Avatar

Send message
Joined: 12 Jan 05
Posts: 12
Credit: 40,824
RAC: 0
Message 27292 - Posted: 12 Mar 2007, 15:00:17 UTC - in response to Message 27291.  

The server would just ignore the repeated trickle and would start acknowledging trickles again from the first new one. (Says one who has restored from backups often!) However, if it crashes again in the same place, you\'ve probably got a looper :-( and should abandon it.


Thanks for reply. But it turns out this was a hypothetical question. I checked the backup. As this machine is no longer in production, it is also no longer included in our backup cycle. Pity.
ID: 27292 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27294 - Posted: 12 Mar 2007, 16:37:54 UTC

This error could have been caused by shutting down the computer without exiting from boinc first, or perhaps if you had a freeze-up and had to shut boinc down using Task Manager. It can sometimes mean your graphics card needs an update.

To avoid future crashes I\'d have a look at the READMEs in my sig:

*In the README about Running the model, the top tips

*In the one about crashes, items #5 by Mike and #6 by Thyme Lawn

*Same crashes README - item #1 by Les gives an easy manual backup method that you could use in future. There are extra methods in the README specifically about backups.

One can\'t rely on automatic backups of one\'s whole hard drive or of a roomful of computers at one\'s place of work if they\'re carried out while the model\'s running. Backups made while the model\'s running can\'t be successfully restored.

Congratulations on completing so much of the model. What you\'ve crunched will be used by the researchers.
Cpdn news
ID: 27294 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 27514 - Posted: 26 Mar 2007, 1:44:38 UTC

Crash at 97.75% / 278 days :´(

Tutta, you\'re not alone ;-)


I still have one coupled model going at 97,27% now - no idea how many days, BOINC lost track when it moved from a dual Xeon to a dual P3s.

If that one works, it will be my 1st HadCM3 full run - had a few HadSM3 and Sulphurs full runs but a coupled one is not in my collection yet.
ID: 27514 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27518 - Posted: 26 Mar 2007, 7:33:48 UTC


I\'ve got my fingers crossed for your model :-)

97% does count as \'completed\' from the scientists viewpoint since virtually all of the key data has already been uploaded by that point. So congratulations for that one too, although I understand completely why you\'d prefer to be at 100%!

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27518 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 27529 - Posted: 26 Mar 2007, 18:06:03 UTC
Last modified: 26 Mar 2007, 18:08:07 UTC

It ignores duplicate trickles btw., they do not even cause trouble, if they come from a different machine than the one the model has been delivered to.

Thanks for the head up :-) A full run at CPDN is something special, yes - and at least once I want to see the 100%
ID: 27529 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 27693 - Posted: 2 Apr 2007, 17:22:27 UTC

It\'s done, that 50% Xeon / 50% P3s model, it\'s a full run !!!

That\'s better than ;-)
ID: 27693 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27694 - Posted: 2 Apr 2007, 17:25:13 UTC


Congratulations :-)

(And you won\'t even be busted by the cops)
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27694 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27695 - Posted: 2 Apr 2007, 17:31:19 UTC
Last modified: 2 Apr 2007, 17:31:56 UTC

Well done! And possibly the straightest graph I\'ve ever seen.


Cpdn news
ID: 27695 · Report as offensive     Reply Quote
mray

Send message
Joined: 30 Apr 06
Posts: 8
Credit: 10,884,632
RAC: 1,681
Message 27883 - Posted: 14 Apr 2007, 1:54:15 UTC

I just had one end with a computation error when it was in the mid 90s percentile. My last backup was at 83%. It was reporting some kind of file I/O errors. I had to really struggle to get the last one to complete, many restores. This time it\'s to far back and I abandoned it.

My last WU was on a network drive, but I had this one on a local drive. I have no idea what caused the errors this time around. This was the first error it had, made it all the way to the mid 90s with no problem, then BANG.

I\'m abandoning the project for now, maybe forever. They need to work on this thing more to break the WUs up more, a single run takes far to long right now and there are too many possibilities for errors. Backing it up myself is not a viable solution, BOINC is not supposed to require user intervention.



ID: 27883 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27885 - Posted: 14 Apr 2007, 8:14:27 UTC


Given that the model uploads it\'s climate as it goes (summary at each year, more detailed summary at the model decades, and a full restart dump at 1960, 2000, and 2040), it doesn\'t really matter a huge amount if it doesn\'t reach it\'s end.

Not only that, but your model reached 2071 - which is as good as done as far as the scientists are concerned. It will already have been added to the count of completed models on the front page (successfully completed = 2050 as far as the first page is concerned).

And finally - the reason it was stopped was that it was aborted 10 minutes before you posted.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6206053
<core_client_version>5.4.11</core_client_version>
<message>
aborted by user
</message>
<stderr_txt>

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27885 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 27888 - Posted: 14 Apr 2007, 9:09:45 UTC

And if (or when) you do come back, remember to ask about the automatic, set-and-forget, backup system.

ID: 27888 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 27894 - Posted: 14 Apr 2007, 12:53:53 UTC


:-) Les! not fair :-) if there is such a method for back up :-( at least let us know ;-) i at least am not leaving but on such Long WU\'s :-) i do tend to get night mares ;-) LoL
Regards
Masud.
ID: 27894 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27895 - Posted: 14 Apr 2007, 13:13:30 UTC
Last modified: 14 Apr 2007, 13:16:04 UTC

There\'s a backup-and-restore page here, which discusses links to loads of different ways of doing backups. I tend to stick to the \'old school\' way of shutting down boinc, copying the boinc folder, and restarting it, at weekly intervals, but a lot of people use more sophisticated methods!

The thread is here:
http://www.climateprediction.net/board/viewtopic.php?t=5895

The automatic backup programme is here:
http://bbc.cpdn.org/forum_thread.php?id=2748

It was written by RRodway, a participant originally on the BBC/CCE variant of the project (which is why you\'ll see \'BBC\' all over it).

Sometimes a backup just won\'t help - for example, if a model crashes with a \'NEGATIVE PRESSURE DETECTED\' error, it\'ll almost always crash at the same spot again if you restore it. But in this case, it means that the model has reached as far as it\'ll ever go.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27895 · Report as offensive     Reply Quote
mray

Send message
Joined: 30 Apr 06
Posts: 8
Credit: 10,884,632
RAC: 1,681
Message 27907 - Posted: 14 Apr 2007, 20:38:10 UTC - in response to Message 27885.  


And finally - the reason it was stopped was that it was aborted 10 minutes before you posted.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6206053
<core_client_version>5.4.11</core_client_version>
<message>
aborted by user
</message>
<stderr_txt>

I restored from backup and then aborted that, probably not the best idea but I already deleted the one that had the computation failure. I don\'t think it made much difference though, but it may affect stats concerning computation errors vs user aborted.


ID: 27907 · Report as offensive     Reply Quote
old_user201021

Send message
Joined: 30 Sep 06
Posts: 18
Credit: 93,623
RAC: 0
Message 28390 - Posted: 3 May 2007, 4:08:33 UTC

Maybe it was corruption caused by Cosmic Rays. What is your elevation there?
ID: 28390 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 28542 - Posted: 9 May 2007, 0:10:09 UTC
Last modified: 9 May 2007, 0:11:49 UTC

A team mate just posted (in the team forum), that his model
crashed after shutting down / restarting his computer.

When I checked the result, I found \"aborted by user,,
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Abort request from BOINC...\"

CC is 5.8.8

Is there some known issue with CPDN and shutdowns or is it
cosmic rays again?

resultid=6513196

He might not have stopped BOINC before he shut down but
it would be good if a model survived that.
ID: 28542 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 28543 - Posted: 9 May 2007, 0:55:36 UTC
Last modified: 9 May 2007, 1:01:13 UTC

Hi Ananas

Yes, there\'s a big issue with shutting down the computer. Your friend needs to exit from boinc first by right-clicking on the B icon and selecting Exit. Then wait for the icon to disappear, then begin the shutdown process. Sometimes a model survives if boinc isn\'t exited, sometimes it doesn\'t.

However, the model\'s results page doesn\'t show that. It shows \'aborted by user\':

http://bbc.cpdn.org/show_user.php?userid=156404

All your friend\'s models are quickly crashing. Eg I think he must have run an AV scan without exiting from boinc. The scans locked a file and crashed these:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6497343
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6492724
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6465859

and maybe this one though it crashed with a 107 code which is different from the others

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6458633

and he aborted his first model:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6447344

His computer looks good - dual-core, lots of memory. But I don\'t think he has any idea how to keep his models safe. Please explain to him that he mustn\'t abort his models or he\'ll never finish one.

Would it be possible for you to translate the following items into German for your team members and put them in a sticky on your forum? Go to the project READMEs through my signature.

*Running the model README - the top tips

*README about avoiding crashes - you\'d need to make a summary of item #5 by Mike

*Crashes README again - item #1 by Les about how to back models up.

That\'s the essential info that cpdn crunchers need. If you did translate that, it would be a good idea if you could post the German translation here so that Saenger, Tomcat and the many other German team members can copy it to their forums for people whose English isn\'t so good. I\'m sure they would do this. There are several German forums with cpdn crunchers.

If you can do that, it would be better to post a translation & discuss it in a separate thread which could all be in German. With a thread title in German to attract the people who need to read it.


Cpdn news
ID: 28543 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Unrecoverable error after 4100 hours :(

©2024 cpdn.org