climateprediction.net (CPDN) home page
Thread 'Model crashed just at the end of phase 2'

Thread 'Model crashed just at the end of phase 2'

Message boards : Number crunching : Model crashed just at the end of phase 2
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12622 - Posted: 16 May 2005, 16:42:22 UTC

My model crashed just at the end of phase 2 (message in the messages-tab: 16-5-2005 16:31:26|climateprediction.net|Unrecoverable error for result 1d7y_000084449_0 ( - exit code -1073741819 (0xc0000005))). I was watching the model via "show graphics" when the model was writing its data to the disk. I closed the window, after a while I opened it again and the earth was completely blank (no colours). Shortly after that the model crashed. I (stupid me!) didn't copy the Windows-message which appeared after that.

Can the crash be related to me, watching the model? Can I do anything to provide you with more data? The model was using version hadsm3 4.10.

I was very disappointed that this happened after about 520 hour of work (since 23 March)...

Bye for now,

Frederique
ID: 12622 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12623 - Posted: 16 May 2005, 16:57:53 UTC

Another question: is it possible to restart the model using the same WU?

Greetings,

Frederique
ID: 12623 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 12624 - Posted: 16 May 2005, 17:09:57 UTC - in response to Message 12622.  

> My model crashed just at the end of phase 2 (message in the messages-tab:
> 16-5-2005 16:31:26|climateprediction.net|Unrecoverable error for result
> 1d7y_000084449_0 ( - exit code -1073741819 (0xc0000005))). I was watching the
> model via "show graphics" when the model was writing its data to the disk. I
> closed the window, after a while I opened it again and the earth was
> completely blank (no colours). Shortly after that the model crashed. I (stupid
> me!) didn't copy the Windows-message which appeared after that.

Welcome to the -1073741819 club. More and more people are getting these at the end of phase. Did you have internet access connected on at the time of failure?

>
> Can the crash be related to me, watching the model? Can I do anything to
> provide you with more data? The model was using version hadsm3 4.10.
>

Possibly a graphics problem, but that error is certainly not absolutely correlated with it. My errors (same error number), have been at end of phase with no graphics running.

> I was very disappointed that this happened after about 520 hour of work (since
> 23 March)...
>

Understandable disappointment. One of mine crashed at the end of phase 2 and the other at the end of phase 3. Both times internet connection to the server was questionable. It shouldn't matter, but it may. Other suggestions that may or may not help...make sure you have all the latest drivers for you motherboard/graphics/network card/modem. Some references to that error number on the net suggest driver problems...but I think that is a shot in the dark here.

As for saving the crashed model, the only way that could happen is if you had a recent backup of the boinc directory prior to the crash. In that case you could copy it back over the working directory and it would continue on. Of course if you are running other boinc projects as well, this would introduce other complications.
ID: 12624 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12625 - Posted: 16 May 2005, 17:16:53 UTC

Thank you for your answer. I had internet access (it's an ADSL-line), I have no reason to beleave that there was no internet access at that time, as I could download next WU without any problems...

Did you get these problems also with version 4.12 of hadsm3? Which version of the BOINC-client did you use? (I use 4.25). What a pitty at the end of phase 3, it must have been be very frustrating for you...

Bye for now,

Frederique
ID: 12625 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 12626 - Posted: 16 May 2005, 17:23:45 UTC
Last modified: 16 May 2005, 17:29:32 UTC

HI,
A few weeks ago, I had a crash when watching the globe while post-processing in phase I.
I had a backup (always do a backup before end of phase and disable network access) and crunched again the end of the phase without watching the globe: it worked fine and this Wu is now in phase 3.
I suspect an unstability of the viz while the post-processing is under way: IMO, opening and closing the viz seems to crash the model.
Arnaud
ID: 12626 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12628 - Posted: 16 May 2005, 17:36:37 UTC

If it is any help: in the eventviewer I saw the following event:
======================================================================
Event Type: Error
Event Source: Application Error
Event Category: None
Event ID: 1000
Date: 16-5-2005
Time: 16:31:09
User: N/A
Computer: NORROD
Description:
Faulting application hadsm3_4.10_windows_intelx86.exe, version 0.0.0.0, faulting module hadsm3_4.10_windows_intelx86.exe, version 0.0.0.0, fault address 0x00040e47.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
======================================================================
Arnaud, thank you for your tip: I'll try this (I have a backup just 4 trickles ago)!
ID: 12628 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 12629 - Posted: 16 May 2005, 20:50:02 UTC - in response to Message 12625.  

> Did you get these problems also with version 4.12 of hadsm3? Which version of
> the BOINC-client did you use? (I use 4.25). What a pitty at the end of phase
> 3, it must have been be very frustrating for you...
>
This was with BOINC 4.25 and hadsm3 4.12. Yes, very frustrating. I wish I knew what was causing it.
ID: 12629 · Report as offensive     Reply Quote
old_user67989

Send message
Joined: 30 Mar 05
Posts: 2
Credit: 366,631
RAC: 0
Message 12631 - Posted: 16 May 2005, 21:49:09 UTC

I had the same thing happen this evening.

climateprediction.net - 2005-05-16 18:26:22 - Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
climateprediction.net - 2005-05-16 18:26:55 - Unrecoverable error for result 1yvn_300112795_0 ( - exit code -1073741819 (0xc0000005))
climateprediction.net - 2005-05-16 18:26:55 - Deferring communication with project for 1 minutes and 0 seconds
climateprediction.net - 2005-05-16 18:26:55 - Computation for result 1yvn_300112795 finished

I had 2 WUs running on a P4 HT as usual. One had recently completed phase 2 but I had network access disabled and dial-up offline so it had gone into phase 3 without uploading. I enabled network, it went through the dial-up sequence and the trickles etc went up. Then the WU bombed but left the other WU, already in phase 3, running. Boinc downloaded and started another WU immediately.
ID: 12631 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 12632 - Posted: 16 May 2005, 22:02:37 UTC

Paul
Did you let BOINC do the dialing?

There was a bug since at least 4.05, which caused the cp programs to lockup when BOINC terminated the connection. I set my prefs to yes for 'Confirm before connecting to Internet?', and dialup myself.
4.25 doesn't dialup, but when I forgot to Disable BOINC Network Access, it attempted to contact the scheduler.

Perhaps there is still something wrong when it's allowed to do it by itself, this time crashing a model.

Les


ID: 12632 · Report as offensive     Reply Quote
old_user67989

Send message
Joined: 30 Mar 05
Posts: 2
Credit: 366,631
RAC: 0
Message 12633 - Posted: 16 May 2005, 22:09:43 UTC

Les,

yes I let boinc dial. I had been running it for a while with confirmation, but that seemed to cause the 'no finished file' problem and consequent loss of crunching time. I went through the phase 2 to 3 transition with the other WU just a few days ago, no problem. I agree though that the dial-up does seem slightly flaky at times.
ID: 12633 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12699 - Posted: 20 May 2005, 4:51:21 UTC - in response to Message 12626.  
Last modified: 20 May 2005, 5:08:30 UTC

> I had a backup [...] and crunched again the end of the phase without watching the globe: it
> worked fine and this Wu is now in phase 3.

I saw your result: it seems that the error-code has been reset to zero. My WU has now processed the end of phase 2 and this time it ended without crashing (pfew!). The error-code however has not been reset, so the WU has outcome "Client error". Is this something that recoveres later? Or is it useless for me to go on with this WU because it will be reassigned to someone else?

See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=440403

Bye for now,

Frederique
ID: 12699 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 12702 - Posted: 20 May 2005, 6:13:59 UTC - in response to Message 12699.  

> I saw your result: it seems that the error-code has been reset to zero. My WU
> has now processed the end of phase 2 and this time it ended without crashing
> (pfew!). The error-code however has not been reset, so the WU has outcome
> "Client error". Is this something that recoveres later? Or is it useless for
> me to go on with this WU because it will be reassigned to someone else?
>
> See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=440403

I'd continue with that WU, Frederique. Although your result is showing an outcome of 'Client error' the server will continue to accept trickles and the final upload from it, and the project team will make use of the result. The outcome and server state won't change (even when you've completed all 3 phases and uploaded the result), but that's just due to the database only accepting the first completion indication. The rescheduled result isn't going to conflict with yours as they have different result names (yours has a '_0' suffix, the new one has '_1'), and it's possible the other one will be allocated to a host that errors out very quickly.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 12702 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12716 - Posted: 20 May 2005, 11:25:21 UTC

Okay, I'll continue with this WU, thank you for your response.

Bye,

Frederique
ID: 12716 · Report as offensive     Reply Quote
old_user63187

Send message
Joined: 12 Mar 05
Posts: 10
Credit: 156,893
RAC: 0
Message 12754 - Posted: 21 May 2005, 8:42:05 UTC - in response to Message 12702.  

> it's possible the other one will be allocated to a host that errors out very quickly.

As if you could foresee the future: this is exactly what happened...

Thanks again,

Frederique
ID: 12754 · Report as offensive     Reply Quote
old_user324

Send message
Joined: 6 Aug 04
Posts: 3
Credit: 1,735,317
RAC: 0
Message 13189 - Posted: 7 Jun 2005, 6:37:11 UTC

Another -5 exit :-(

06-06-2005 22:02:35||Suspending computation and network activity - running CPU benchmarks
06-06-2005 22:02:35|climateprediction.net|Pausing result 38zl_200173143_1 (removed from memory)
06-06-2005 22:02:37||Running CPU benchmarks
06-06-2005 22:02:45||Aborting CPU benchmarks, one or more active tasks are still running.
06-06-2005 22:02:45||Resuming computation and network activity
06-06-2005 22:02:53||request_reschedule_cpus: process exited
06-06-2005 22:02:53||schedule_cpus: must schedule
06-06-2005 22:02:53|climateprediction.net|Restarting result 38zl_200173143_1 using hadsm3 version 4.12
06-06-2005 22:10:12|ProteinPredictorAtHome|Deferring communication with project for 7 hours, 59 minutes, and 24 seconds
06-06-2005 22:33:21|Pirates@Home|Deferring communication with project for 4 hours, 57 minutes, and 18 seconds
06-06-2005 22:34:08|climateprediction.net|Unrecoverable error for result 38zl_200173143_1 ( - exit code -5 (0xfffffffb))
06-06-2005 22:34:08||request_reschedule_cpus: process exited
06-06-2005 22:34:08||schedule_cpus: must schedule
06-06-2005 22:34:08|climateprediction.net|Deferring communication with project for 59 seconds
06-06-2005 22:34:08|climateprediction.net|Computation for result 38zl_200173143_1 finished

ID: 13189 · Report as offensive     Reply Quote
old_user63950

Send message
Joined: 16 Mar 05
Posts: 2
Credit: 107,907
RAC: 0
Message 13201 - Posted: 7 Jun 2005, 17:08:13 UTC
Last modified: 7 Jun 2005, 17:11:04 UTC

My model crashed at the end of Phase 3 - after the last trickle. CP reports:

4.19
process got signal 10

3
10

zip warning: Too many open files
zip warning: could not open for reading: 1eijba.ph33c10.x2.nc
zip warning: zip file empty
zip I/O error: Too many open files

zip error: Temporary file failure (zi48emQq)






:-( right at the bitter end....
ID: 13201 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 13202 - Posted: 7 Jun 2005, 17:31:45 UTC

Lupus
This is a common problem with Macs. See <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2582"> this</a> thread, and my post at the bottom of <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2659"> this</a> post.

Les

ID: 13202 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 13207 - Posted: 7 Jun 2005, 20:27:59 UTC
Last modified: 7 Jun 2005, 20:29:44 UTC

I have given up on running CPDN on my PowerMac as the models are crashing almost as soon as I start ...

Of course, the good news is that I have bad examples now ...

Oh, Les ... can you e-mail me the log of that zip error ... I want to add those messages to the Wiki ... I hate to try to get the messages out of the forums as they never seem to get fixed to match real world when I try to do repairs by hand.

p.d.buck@comcast.net

Thanks...
ID: 13207 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 13208 - Posted: 7 Jun 2005, 21:15:37 UTC
Last modified: 7 Jun 2005, 22:03:03 UTC

Sorry Paul, I may have confused you. I run Win xp, not a Mac. Just an interested spectator with this problem.
Nice site you have, btw.

Les

edit
Actually, I can do a bit better. <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=20179"> This</a> is the account page for the Mac user that got me interested in the zip problem. He was right at the top of the list of '1 computer people on CPDN' list on BOINCstats before the xml stats got trashed.
All models completed with correct credit, but none with the magic word "Success". It looks as though he may have given up part way through the 'current' model.



ID: 13208 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 13230 - Posted: 8 Jun 2005, 11:38:28 UTC - in response to Message 13208.  

&gt; Sorry Paul, I may have confused you. I run Win xp, not a Mac. Just an
&gt; interested spectator with this problem.
&gt; Nice site you have, btw.

Yes ... :)

Thanks! I am trying ... though most think I am <b>very</b> trying ... :)

I can't take full credit any more ... I have a few people doing work on it now. Plus, I qot to canabilize the official UCB site ... though I am still not quite done there ...

&gt; All models completed with correct credit, but none with the magic word
&gt; "Success". It looks as though he may have given up part way through the
&gt; 'current' model.

Ah, well, for me, they don't run now. I don't know if it is cause I am running "Tiger" or not. But, I don't really have the time to chase the problem. To be honest I don't recall ever finishing a model on the Mac. I have done quite a few on the PCs I run ... so, I rather use my time doing models I am confident they will run to the end and be useful.

If you have any interesting log files I would love the chance to look at them ... if you would not mind ziping them up and sending them to me ...
ID: 13230 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Model crashed just at the end of phase 2

©2024 cpdn.org