Message boards : Number crunching : Model crashed just at the end of phase 2
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
My model crashed just at the end of phase 2 (message in the messages-tab: 16-5-2005 16:31:26|climateprediction.net|Unrecoverable error for result 1d7y_000084449_0 ( - exit code -1073741819 (0xc0000005))). I was watching the model via "show graphics" when the model was writing its data to the disk. I closed the window, after a while I opened it again and the earth was completely blank (no colours). Shortly after that the model crashed. I (stupid me!) didn't copy the Windows-message which appeared after that. Can the crash be related to me, watching the model? Can I do anything to provide you with more data? The model was using version hadsm3 4.10. I was very disappointed that this happened after about 520 hour of work (since 23 March)... Bye for now, Frederique |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
Another question: is it possible to restart the model using the same WU? Greetings, Frederique |
Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275 |
> My model crashed just at the end of phase 2 (message in the messages-tab: > 16-5-2005 16:31:26|climateprediction.net|Unrecoverable error for result > 1d7y_000084449_0 ( - exit code -1073741819 (0xc0000005))). I was watching the > model via "show graphics" when the model was writing its data to the disk. I > closed the window, after a while I opened it again and the earth was > completely blank (no colours). Shortly after that the model crashed. I (stupid > me!) didn't copy the Windows-message which appeared after that. Welcome to the -1073741819 club. More and more people are getting these at the end of phase. Did you have internet access connected on at the time of failure? > > Can the crash be related to me, watching the model? Can I do anything to > provide you with more data? The model was using version hadsm3 4.10. > Possibly a graphics problem, but that error is certainly not absolutely correlated with it. My errors (same error number), have been at end of phase with no graphics running. > I was very disappointed that this happened after about 520 hour of work (since > 23 March)... > Understandable disappointment. One of mine crashed at the end of phase 2 and the other at the end of phase 3. Both times internet connection to the server was questionable. It shouldn't matter, but it may. Other suggestions that may or may not help...make sure you have all the latest drivers for you motherboard/graphics/network card/modem. Some references to that error number on the net suggest driver problems...but I think that is a shot in the dark here. As for saving the crashed model, the only way that could happen is if you had a recent backup of the boinc directory prior to the crash. In that case you could copy it back over the working directory and it would continue on. Of course if you are running other boinc projects as well, this would introduce other complications. |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
Thank you for your answer. I had internet access (it's an ADSL-line), I have no reason to beleave that there was no internet access at that time, as I could download next WU without any problems... Did you get these problems also with version 4.12 of hadsm3? Which version of the BOINC-client did you use? (I use 4.25). What a pitty at the end of phase 3, it must have been be very frustrating for you... Bye for now, Frederique |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
HI, A few weeks ago, I had a crash when watching the globe while post-processing in phase I. I had a backup (always do a backup before end of phase and disable network access) and crunched again the end of the phase without watching the globe: it worked fine and this Wu is now in phase 3. I suspect an unstability of the viz while the post-processing is under way: IMO, opening and closing the viz seems to crash the model. Arnaud |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
If it is any help: in the eventviewer I saw the following event: ====================================================================== Event Type: Error Event Source: Application Error Event Category: None Event ID: 1000 Date: 16-5-2005 Time: 16:31:09 User: N/A Computer: NORROD Description: Faulting application hadsm3_4.10_windows_intelx86.exe, version 0.0.0.0, faulting module hadsm3_4.10_windows_intelx86.exe, version 0.0.0.0, fault address 0x00040e47. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. ====================================================================== Arnaud, thank you for your tip: I'll try this (I have a backup just 4 trickles ago)! |
Send message Joined: 7 Aug 04 Posts: 2185 Credit: 64,822,615 RAC: 5,275 |
> Did you get these problems also with version 4.12 of hadsm3? Which version of > the BOINC-client did you use? (I use 4.25). What a pitty at the end of phase > 3, it must have been be very frustrating for you... > This was with BOINC 4.25 and hadsm3 4.12. Yes, very frustrating. I wish I knew what was causing it. |
Send message Joined: 30 Mar 05 Posts: 2 Credit: 366,631 RAC: 0 |
I had the same thing happen this evening. climateprediction.net - 2005-05-16 18:26:22 - Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded climateprediction.net - 2005-05-16 18:26:55 - Unrecoverable error for result 1yvn_300112795_0 ( - exit code -1073741819 (0xc0000005)) climateprediction.net - 2005-05-16 18:26:55 - Deferring communication with project for 1 minutes and 0 seconds climateprediction.net - 2005-05-16 18:26:55 - Computation for result 1yvn_300112795 finished I had 2 WUs running on a P4 HT as usual. One had recently completed phase 2 but I had network access disabled and dial-up offline so it had gone into phase 3 without uploading. I enabled network, it went through the dial-up sequence and the trickles etc went up. Then the WU bombed but left the other WU, already in phase 3, running. Boinc downloaded and started another WU immediately. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Paul Did you let BOINC do the dialing? There was a bug since at least 4.05, which caused the cp programs to lockup when BOINC terminated the connection. I set my prefs to yes for 'Confirm before connecting to Internet?', and dialup myself. 4.25 doesn't dialup, but when I forgot to Disable BOINC Network Access, it attempted to contact the scheduler. Perhaps there is still something wrong when it's allowed to do it by itself, this time crashing a model. Les |
Send message Joined: 30 Mar 05 Posts: 2 Credit: 366,631 RAC: 0 |
Les, yes I let boinc dial. I had been running it for a while with confirmation, but that seemed to cause the 'no finished file' problem and consequent loss of crunching time. I went through the phase 2 to 3 transition with the other WU just a few days ago, no problem. I agree though that the dial-up does seem slightly flaky at times. |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
> I had a backup [...] and crunched again the end of the phase without watching the globe: it > worked fine and this Wu is now in phase 3. I saw your result: it seems that the error-code has been reset to zero. My WU has now processed the end of phase 2 and this time it ended without crashing (pfew!). The error-code however has not been reset, so the WU has outcome "Client error". Is this something that recoveres later? Or is it useless for me to go on with this WU because it will be reassigned to someone else? See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=440403 Bye for now, Frederique |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
> I saw your result: it seems that the error-code has been reset to zero. My WU > has now processed the end of phase 2 and this time it ended without crashing > (pfew!). The error-code however has not been reset, so the WU has outcome > "Client error". Is this something that recoveres later? Or is it useless for > me to go on with this WU because it will be reassigned to someone else? > > See http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=440403 I'd continue with that WU, Frederique. Although your result is showing an outcome of 'Client error' the server will continue to accept trickles and the final upload from it, and the project team will make use of the result. The outcome and server state won't change (even when you've completed all 3 phases and uploaded the result), but that's just due to the database only accepting the first completion indication. The rescheduled result isn't going to conflict with yours as they have different result names (yours has a '_0' suffix, the new one has '_1'), and it's possible the other one will be allocated to a host that errors out very quickly. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
Okay, I'll continue with this WU, thank you for your response. Bye, Frederique |
Send message Joined: 12 Mar 05 Posts: 10 Credit: 156,893 RAC: 0 |
> it's possible the other one will be allocated to a host that errors out very quickly. As if you could foresee the future: this is exactly what happened... Thanks again, Frederique |
Send message Joined: 6 Aug 04 Posts: 3 Credit: 1,735,317 RAC: 0 |
Another -5 exit :-( 06-06-2005 22:02:35||Suspending computation and network activity - running CPU benchmarks 06-06-2005 22:02:35|climateprediction.net|Pausing result 38zl_200173143_1 (removed from memory) 06-06-2005 22:02:37||Running CPU benchmarks 06-06-2005 22:02:45||Aborting CPU benchmarks, one or more active tasks are still running. 06-06-2005 22:02:45||Resuming computation and network activity 06-06-2005 22:02:53||request_reschedule_cpus: process exited 06-06-2005 22:02:53||schedule_cpus: must schedule 06-06-2005 22:02:53|climateprediction.net|Restarting result 38zl_200173143_1 using hadsm3 version 4.12 06-06-2005 22:10:12|ProteinPredictorAtHome|Deferring communication with project for 7 hours, 59 minutes, and 24 seconds 06-06-2005 22:33:21|Pirates@Home|Deferring communication with project for 4 hours, 57 minutes, and 18 seconds 06-06-2005 22:34:08|climateprediction.net|Unrecoverable error for result 38zl_200173143_1 ( - exit code -5 (0xfffffffb)) 06-06-2005 22:34:08||request_reschedule_cpus: process exited 06-06-2005 22:34:08||schedule_cpus: must schedule 06-06-2005 22:34:08|climateprediction.net|Deferring communication with project for 59 seconds 06-06-2005 22:34:08|climateprediction.net|Computation for result 38zl_200173143_1 finished |
Send message Joined: 16 Mar 05 Posts: 2 Credit: 107,907 RAC: 0 |
My model crashed at the end of Phase 3 - after the last trickle. CP reports: 4.19 process got signal 10 3 10 zip warning: Too many open files zip warning: could not open for reading: 1eijba.ph33c10.x2.nc zip warning: zip file empty zip I/O error: Too many open files zip error: Temporary file failure (zi48emQq) :-( right at the bitter end.... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Lupus This is a common problem with Macs. See <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2582"> this</a> thread, and my post at the bottom of <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2659"> this</a> post. Les |
Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0 |
I have given up on running CPDN on my PowerMac as the models are crashing almost as soon as I start ... Of course, the good news is that I have bad examples now ... Oh, Les ... can you e-mail me the log of that zip error ... I want to add those messages to the Wiki ... I hate to try to get the messages out of the forums as they never seem to get fixed to match real world when I try to do repairs by hand. p.d.buck@comcast.net Thanks... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Sorry Paul, I may have confused you. I run Win xp, not a Mac. Just an interested spectator with this problem. Nice site you have, btw. Les edit Actually, I can do a bit better. <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=20179"> This</a> is the account page for the Mac user that got me interested in the zip problem. He was right at the top of the list of '1 computer people on CPDN' list on BOINCstats before the xml stats got trashed. All models completed with correct credit, but none with the magic word "Success". It looks as though he may have given up part way through the 'current' model. |
Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0 |
> Sorry Paul, I may have confused you. I run Win xp, not a Mac. Just an > interested spectator with this problem. > Nice site you have, btw. Yes ... :) Thanks! I am trying ... though most think I am <b>very</b> trying ... :) I can't take full credit any more ... I have a few people doing work on it now. Plus, I qot to canabilize the official UCB site ... though I am still not quite done there ... > All models completed with correct credit, but none with the magic word > "Success". It looks as though he may have given up part way through the > 'current' model. Ah, well, for me, they don't run now. I don't know if it is cause I am running "Tiger" or not. But, I don't really have the time to chase the problem. To be honest I don't recall ever finishing a model on the Mac. I have done quite a few on the PCs I run ... so, I rather use my time doing models I am confident they will run to the end and be useful. If you have any interesting log files I would love the chance to look at them ... if you would not mind ziping them up and sending them to me ... |
©2024 cpdn.org