Questions and Answers :
Windows :
hadsm3_4.10_windows_intelx86.exe has encountered a problem and needs to close.
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
"We are sorry for the inconvenience". Indeed! The WU has completed 99.99% (863:33:57 - 00:03:27 left to completion), and the application crashes on me! These are the details: AppName: hadsm3_4.10_windows_intelx86.exe AppVer: 0.0.0.0 ModName: ntdll.dll ModVer: 5.1.2600.2180 Offset: 00011f6e BOINC version was 4.36 |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
After I clicked on the "Don't Send" (to Microsoft) button, I got these additional information: forrtl: severe (24): end-of -file during read, unit 5, file C:\Progrom Files\BOINC\projects\climateprediction.net\3l9b_200189203\jobs\climate.cpdc Image ----------- PC ----- Routine Line -- Source hadsm3um_4.12_win 008C765B Unknown Unknown Unknown hadsm3um_4.12_win 008B132A Unknown Unknown Unknown hadsm3um_4.12_win 008B0039 Unknown Unknown Unknown hadsm3um_4.12_win 008B0564 Unknown Unknown Unknown hadsm3um_4.12_win 0089DFFB Unknown Unknown Unknown hadsm3um_4.12_win 0040790A Unknown Unknown Unknown kernel32.dll 7C816D4F Unknown Unknown Unknown Seems that new WU immediately crashed after it started. I have a save point from the old WU at about 98%. If I'd restore that, install BOINC 4.43 - is there a chance that I could successfully complete it? |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
Aaargh! I had one of these about a month ago. Mine was actually in end of phase processing when it gave that error code (-1073741819). It looks like yours was too as it uploaded the last trickle in the run. Extremely frustrating. I'd rerun the last 2% to see if you can get past end of phase this time. Make sure you are connected to the net when it is ready to communicate and upload, and perhaps pause the run and defrag your disk before it reaches end of phase. When it uploads correctly, it if does, it may not change the result status on your results page, but the data would be available to the investigators. Good luck! |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
Re-ran these last 2%, and found it again crashed this morning. I didn't actually check the error code yesterday, but today I saw that it was indeed again the dreaded -5 error! Checked my statistics (<A HREF="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=30846">http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=30846</A>) and found that out of 13 WUs only two had actually completed! This is really a very poor success rate. And the fact that the application now crashed twice with the same data at the same processing point indicates to me that something is wrong with the application, and not with my hardware, as so many times is stressed when the -5 error appears. The good news is that the next WU that crashed yesterday with a severe forrtl error is running fine now. |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
If it helps in any way to diagnose the problem, here are the messages from the period when the error occurred the second time. 2005-05-25 14:32:54||Starting BOINC client version 4.43 for windows_intelx86 2005-05-25 14:32:54||Data directory: C:\Program Files\BOINC 2005-05-25 14:32:54|climateprediction.net|Computer ID: 30846; location: ; project prefs: default 2005-05-25 14:32:54|SETI@home|Computer ID: 867134; location: work; project prefs: default 2005-05-25 14:32:54||General prefs: from SETI@home (last modified 2004-10-13 12:01:36) 2005-05-25 14:32:54||General prefs: no separate prefs for work; using your defaults 2005-05-25 14:32:54||Remote control not allowed; using loopback address 2005-05-25 14:33:10|climateprediction.net|Deferring computation for result 17lu_000077101_1 2005-05-25 14:33:10|SETI@home|Deferring computation for result 04fe05aa.10289.28994.267324.180_0 2005-05-25 14:33:10||Resuming computation and network activity 2005-05-25 14:33:10||schedule_cpus: must schedule 2005-05-25 14:33:11|climateprediction.net|Restarting result 17lu_000077101_1 using hadsm3 version 4.10 2005-05-25 15:33:10||schedule_cpus: time 3600.015625 2005-05-25 16:33:10||schedule_cpus: time 3600.055744 2005-05-25 17:33:10||schedule_cpus: time 3600.019131 2005-05-25 17:33:10|climateprediction.net|Pausing result 17lu_000077101_1 (removed from memory) 2005-05-25 17:33:11|SETI@home|Restarting result 04fe05aa.10289.28994.267324.180_0 using setiathome version 4.09 2005-05-25 17:33:14||request_reschedule_cpus: process exited 2005-05-25 17:33:14||schedule_cpus: must schedule 2005-05-25 18:33:14||schedule_cpus: time 3600.005159 2005-05-25 18:33:14|climateprediction.net|Restarting result 17lu_000077101_1 using hadsm3 version 4.10 2005-05-25 18:33:14|SETI@home|Pausing result 04fe05aa.10289.28994.267324.180_0 (removed from memory) 2005-05-25 18:33:15||request_reschedule_cpus: process exited 2005-05-25 18:33:15||schedule_cpus: must schedule 2005-05-25 19:33:15||schedule_cpus: time 3600.035610 2005-05-25 20:33:15||schedule_cpus: time 3600.003874 2005-05-25 21:33:15||schedule_cpus: time 3600.015821 2005-05-25 21:33:15|climateprediction.net|Pausing result 17lu_000077101_1 (removed from memory) 2005-05-25 21:33:15|SETI@home|Restarting result 04fe05aa.10289.28994.267324.180_0 using setiathome version 4.09 2005-05-25 21:33:30||request_reschedule_cpus: process exited 2005-05-25 21:33:30||schedule_cpus: must schedule 2005-05-25 22:33:30||schedule_cpus: time 3600.020432 2005-05-25 22:33:30|climateprediction.net|Restarting result 17lu_000077101_1 using hadsm3 version 4.10 2005-05-25 22:33:30|SETI@home|Pausing result 04fe05aa.10289.28994.267324.180_0 (removed from memory) 2005-05-25 22:33:30||request_reschedule_cpus: process exited 2005-05-25 22:33:30||schedule_cpus: must schedule 2005-05-25 23:33:30||schedule_cpus: time 3600.031965 2005-05-26 00:33:30||schedule_cpus: time 3600.012373 2005-05-26 01:33:30||schedule_cpus: time 3600.078413 2005-05-26 01:33:30|climateprediction.net|Pausing result 17lu_000077101_1 (removed from memory) 2005-05-26 01:33:31|SETI@home|Restarting result 04fe05aa.10289.28994.267324.180_0 using setiathome version 4.09 2005-05-26 01:33:33||request_reschedule_cpus: process exited 2005-05-26 01:33:33||schedule_cpus: must schedule 2005-05-26 02:33:33||schedule_cpus: time 3600.027842 2005-05-26 02:33:33|climateprediction.net|Restarting result 17lu_000077101_1 using hadsm3 version 4.10 2005-05-26 02:33:33|SETI@home|Pausing result 04fe05aa.10289.28994.267324.180_0 (removed from memory) 2005-05-26 02:33:34||request_reschedule_cpus: process exited 2005-05-26 02:33:34||schedule_cpus: must schedule 2005-05-26 03:33:34||schedule_cpus: time 3600.012226 2005-05-26 04:33:34||schedule_cpus: time 3600.059076 2005-05-26 05:10:07|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:10:29|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi failed 2005-05-26 05:10:29|climateprediction.net|No schedulers responded 2005-05-26 05:10:29|climateprediction.net|Deferring communication with project for 1 minutes and 0 seconds 2005-05-26 05:11:30|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:11:53|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi failed 2005-05-26 05:11:53|climateprediction.net|No schedulers responded 2005-05-26 05:11:53|climateprediction.net|Deferring communication with project for 1 minutes and 0 seconds 2005-05-26 05:12:54|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:13:16|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi failed 2005-05-26 05:13:16|climateprediction.net|No schedulers responded 2005-05-26 05:13:16|climateprediction.net|Deferring communication with project for 1 minutes and 0 seconds 2005-05-26 05:14:18|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:15:21|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded 2005-05-26 05:15:28|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:16:30|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded 2005-05-26 05:28:22|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2005-05-26 05:29:24|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded 2005-05-26 05:33:34||schedule_cpus: time 3600.074697 2005-05-26 05:33:34|climateprediction.net|Pausing result 17lu_000077101_1 (removed from memory) 2005-05-26 05:33:34|SETI@home|Restarting result 04fe05aa.10289.28994.267324.180_0 using setiathome version 4.09 2005-05-26 06:33:34||schedule_cpus: time 3600.106119 2005-05-26 06:33:34|SETI@home|Pausing result 04fe05aa.10289.28994.267324.180_0 (removed from memory) 2005-05-26 06:33:39||request_reschedule_cpus: process exited 2005-05-26 06:33:39||schedule_cpus: must schedule 2005-05-26 07:33:39||schedule_cpus: time 3600.026260 2005-05-26 08:33:39||schedule_cpus: time 3600.033154 2005-05-26 09:33:40||schedule_cpus: time 3600.011444 2005-05-26 09:33:40|climateprediction.net|Pausing result 17lu_000077101_1 (removed from memory) 2005-05-26 09:33:40|SETI@home|Restarting result 04fe05aa.10289.28994.267324.180_0 using setiathome version 4.09 2005-05-26 10:15:54|climateprediction.net|Unrecoverable error for result 17lu_000077101_1 ( - exit code -1073741819 (0xc0000005)) 2005-05-26 10:15:54||request_reschedule_cpus: process exited 2005-05-26 10:15:54|climateprediction.net|Deferring communication with project for 59 seconds 2005-05-26 10:15:54|climateprediction.net|Computation for result 17lu_000077101_1 finished 2005-05-26 10:15:54||schedule_cpus: must schedule |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
If you have a backup from just before the crash, you can move the model to a different machine and try to finish the last few minutes there. CPDN crashed repeatedly on one of my machines on trickle 24 (end of phase1) with a lot of null bytes in the result files where data should be. Same exit code as yours. There is something wrong with the "big trickles" that doesn't happen on all machines - maybe even just sleeping a few microseconds between all those fopen / fclose things could fix it so the file buffers can be flushed properly. One of those I saved just seconds before the crash and moved it, it's a full run now, no trouble from that other machine :-) And before anyone tries to tell me that it's my machine : it's prime stable, has passed memtest, isn't OCed, doesn't have a heat problem (2 P3s Tualatin) and runs unattended for months without any flaws - so forget it, it isn't the machine. |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
> And before anyone tries to tell me that it's my machine : it's prime stable, > has passed memtest, isn't OCed, doesn't have a heat problem (2 P3s Tualatin) > and runs unattended for months without any flaws - so forget it, it isn't the > machine. > I don't think people are associating that error number much with unstable hardware. Too many stable systems are now getting this error, especially (but not exclusively) at phase end. Why some PCs appear more susceptible to that error, I don't know. The PC I am having this problem on is stable in all tests (ran long periods), including HD tests. It has to be some relatively recent thing in boinc, hadsm, or MS patches that only affects certain hardware or software configurations. Kind of wide open at that. But your idea about flushing buffers makes sense given it appears during periods of intense HD activity. |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
> If you have a backup from just before the crash, you can move the model to a > different machine and try to finish the last few minutes there. I do have a backup, and I do have another PC that could try to complete the job. However, I wouldn't know how to transfer that to another PC to do that. How would I go about that? Install BOINC first? Create a new host? Replace what data from the original host? |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
> I do have a backup, and I do have another PC that could try to complete the > job. However, I wouldn't know how to transfer that to another PC to do that. > How would I go about that? > > Install BOINC first? > Create a new host? > Replace what data from the original host? > > This is how I would do it... 1 Copy the entire BOINC folder/directory structure that you had backed up (before the error) to the new hard drive in the same location (C:BOINC or whevever it was). If you have to copy to CD/DVD to move it over, I would zip the entire folder (with subfolders) up first. Otherwise copying the directory structure to CD/DVD, then copying it back to another hard drive will give all the files write protection, which you don't want. 2 Run the installation of the same version of BOINC you were running before, and install it to the same directory you copied BOINC over to. 3 It should pick up from the point you had a backup. Good luck. |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
Thanks for the advice! |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
I have finally had the chance to load the saved WU to a new system, and run the last 2%. Result: the same - it crashes at 99.99%. I reloaded the saved result again and installed BOINC 4.43. Re-ran it again, with the same result: crash at 99.99%. It looks to me that this bug is highly reproducible. I am a software developer, and if I had a reproducible bug like this in my software, I would (probably) be able to debug and fix it. CPDN developers: are you interested in my saved, 98% complete WU? It runs just 7 hours until the crash. [Edited to add:] both original and new system run Win XP on Intel P4 processors (non-OC). |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
> It looks to me that this bug is highly reproducible. Yes. It seems to be a common experience that if you get it then it will probably recur, though it is always worth trying from a backup. Unfortunately, it may well happen on the next WU as well. Funny thing is, in my case it didn't begin immediately I moved above BOINC 4.19, nor does it go away if I drop back to that version. Nor did it happen the first time I ran Hadsm3 4.12. But it has now affected both my P4 machines running Win XP, so the cause is common to both. Added to that is the evidence that problem relates to an access error during file handling. My ignorance of these things is profound, but I suspect that it relates to my network security. I'm running off a Netgear router (using WiFi for one of the machines). I also have NAV. I've tried the obvious, but I'm at a loss to think what else could be affecting both PCs which have otherwise little in common. |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
> ... I suspect that it relates to my network security. The two systems I have used for my test are completely different. One is on a LAN, connecting to the Internet via proxy server, and using firewall and AV software. The other one is a standalone system, without firewall or AV, connecting to the Internet directly via ADSL modem. The HD was defragmented after the 98% completed WU was loaded, and nothing was running beside BOINC and CPDN. The only thing these systems have in common are that they are DELL Optiplex GX270 with Intel P4 processors, running Windows XP-SP2. (No overclocking, no hyperthreading). |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
> The two systems I have used for my test are completely different. > > The only thing these systems have in common are that they are DELL Optiplex > GX270 with Intel P4 processors, running Windows XP-SP2. (No overclocking, no > hyperthreading). Bang goes that theory. And mine use different hardware to this and from each other. Which gets us nearer to solving the puzzle that some people get this error, others reportedly don't. |
Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0 |
Can you look at the files in the dataout folders to see if there are any oddities. ????aa are normally associated with .x1.nc ????ba are normally associated with .x2.nc Are there any files that do not follow this pattern? For example any ????aa.p*.x2.nc files, or any ????ba.p*.x3.nc files after the crash. (I seem to have 1 such file in all my runs: the ????aa.pc.8yac.x2.nc file. Presumably this is because the ????aa.pc.8yac file is not deleted after creating the ????aa.pc.8yac.x1.nc file. I have no idea if the phase transition might go more smoothly without this extraneous file. I also have a ka.pc.8yac.x1, ka.pc.8yac.x2 ka.pc.8yac.x3 and a ka.pc.8yac.x4 file in a completed SC run.) I think the instructions concerning the conversion of the pc.8yac files should be examined for bugs or at least to insert a delete. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've lost track of which / what problem is being persued here now, but if it is the original one, I had one of these 'must close' things a month or so back. It was on a 4.12 model using BOINC 4.25. I Suspended BOINC, waited, Exited, waited, copied all the .xml files in BOINC to a 'save' directory I have for when I need to reboot, ticked 'Don't notify Microsoft', shut down the computer, waited a few seconds, and then powered up again. Windows started, a few progs appeared in the System Tray, I killed the AV and Spybot resident parts, started BOINC Manager, and in a few more seconds the two models were ticking away again. I don't know if this helps, or just irritates. |
Send message Joined: 17 Aug 04 Posts: 753 Credit: 9,804,700 RAC: 0 |
I could not see anything of the sort Crandles mentioned, but found some errors in stderr_um.txt. I've just quoted the end of the file: OPEN: File dataout/2qbhba.da40bp0 Created on Unit 22 OPEN: File dataout/2qbhba.da40bs0 Created on Unit 22 OPEN: File dataout/2qbhba.da40c10 Created on Unit 22 CLOSE: WARNING: Unit 60 Not Opened OPEN: File dataout/2qbhba.pa41c10 Created on Unit 60 CLOSE: WARNING: Unit 63 Not Opened OPEN: File dataout/2qbhba.pd41c10 Created on Unit 63 CLOSE: WARNING: Unit 64 Not Opened OPEN: File dataout/2qbhba.pe41c10 Created on Unit 64 CLOSE: WARNING: Unit 65 Not Opened OPEN: File dataout/2qbhba.pf41c10 Created on Unit 65 CLOSE: WARNING: Unit 66 Not Opened OPEN: File dataout/2qbhba.pg41c10 Created on Unit 66 CLOSE: WARNING: Unit 67 Not Opened OPEN: File dataout/2qbhba.ph41c10 Created on Unit 67 Again, pointing to a file access problem. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Although it looks strange those messages are normal behaviour Andrew. The warnings are generated by a bit of defensive coding and indicate that hadsm3um is trying to close files it hasn't opened yet. <br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> |
Send message Joined: 13 Sep 04 Posts: 228 Credit: 354,979 RAC: 0 |
> Checked my statistics (<A HREF="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=30846">http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=30846</A>) > and found that out of 13 WUs only two had actually completed! This is really a very poor success rate... I am giving up on CPDN now - my last WU has also crashed with a -5 error near 30% completion. 12 out of 14 crashed - this failure rate is way too high. |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
> Although it looks strange those messages are normal behaviour Andrew. The > warnings are generated by a bit of defensive coding and indicate that hadsm3um > is trying to close files it hasn't opened yet. uh - sounds dangerous. If it has a file handle -1 or something like that and detects the open state from the handle <b>before</b> it tries a close(), everything is fine. If this is not the case and the error message comes from a close() call that fails, this can easily be the reason for those problems. As the handle might have been reassigned somewhere else for a different file, a file that still needs to be open (for the ZIP module for example) might get closed. There really seems to be a problem with the file handling in those end states of 24/48/72 so it would be a good idea to revise this part of the code. My guess has been too many open files but closing a file that is still needed might explain the problems too. |
©2024 cpdn.org