Message boards : Number crunching : Ocean model crashed.
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
This model crashed, saying client error. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13132040 I suspect this has been covered before but I don't remember seeing it for this particular batch of models. My other task from the same batch is still running. One of the two other tasks from the work unit crashed at 0 time. This one was around the 50% mark. Despite the, "client error" message I suspect it is not my computer at fault. Dave |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
And the other full resolution ocean model I was running has now crashed, albeit another 20odd percent further through. I notice all the other ones in the same work units didn't complete either but I did hope the one that kept going beyond 70% would complete. Result page for it is http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13122714 Dave |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Dave, I don't know if this is right, but the integer benchmark score is HUGE for that processor. Is it significantly overclocked? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks for that comment which intrigues me. No it is not significantly overclocked.Multiplier is standard - I haven't tried but don't think it is unlocked. [dave@localhost ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz stepping : 10 cpu MHz : 2699.621 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority bogomips : 5399.24 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz stepping : 10 cpu MHz : 2699.621 cache size : 2048 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority bogomips : 5399.60 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: I can't think why the integer benchmark score should be HUGE when I haven't been playing around with overclocking. Dave |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Les, - cpu-z doesn't seem to have a linux version so I have installed PerlMon which claims to tell you your actual cpu frequency and that tells me 2699.814MHz which, not having ever increased the clock frequency on this box by more than 10Hz makes me even more curious as to why the integer benchmark should be, "HUGE" for the processor I have. I am disinclined to believe that anyone who shares the house with me is fiddling as neither have been converted to the power of the penguin. Dave |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Perhaps it's an overestimation from a new version of BOINC. I don't have 6.12.xx running on any of my Linux boxes so am not sure if it's inflated by a new version? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Could be, I haven't ever paid attention to the benchmark scores before upgrading to the latest BOINC so couldn't comment. It still leaves the question mark as to why the crashes. In one of the two tasks in question none of the other tasks in the work unit completed. In the other one completed and the other crashed.If there are likely to be any more of these models, should I remove them from my preferences or not? I normally back up about once a week but haven't yet tried restoring a crashed model. What happens if restoring is successful? are the results accepted after the task is already showing, "Error while computing" in the status column? Sorry if many of these questions are ones that have been answered a number of times already. Dave |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Yes the results are excepted and the data is used by the Scientists just like any other completed WU. The one thing that can be off-putting is that when a restored WU finishes and sends the results a line will appear in messages that says it was previously reported as error. Just ignore this message as it only applies to other projects and has no meaning in CPDN. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Jim, I will try and back up twice a week and in the event of another crash will try restore to see what happens. Dave |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
As you can see task hadcm3n_yfi4_1900_40_007352950_0 crashed at the very end with the 4.zip file missing. 8/23/2011 4:29:03 AM climateprediction.net Generated new computer cross-project ID: ead5a08edbfa16e91d2b66991616c4ee 8/23/2011 4:29:04 AM climateprediction.net Computation for task hadcm3n_yfi4_1900_40_007352950_0 finished 8/23/2011 4:29:04 AM climateprediction.net Output file hadcm3n_yfi4_1900_40_007352950_0_4.zip for task hadcm3n_yfi4_1900_40_007352950_0 absent 8/23/2011 4:29:05 AM climateprediction.net Restarting task hadam3p_pnw_3204_1980_1_007395222_0 using hadam3p_pnw version 609 Could this have anything to do with the recent change in the server used. I do have a backup that I made only 6 hours before the model finished, so if a solution can be found I could restore and run it to the end again. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I restored task hadcm3n_yfi4_1900_40_007352950_0 from the 6 hour backup and ran it again. Same result. It crashes appormx. 2 hours from end. The stderr are shown below. The one that look significant to me is: 23:50:55 (4200): Can't acquire lock file (32) - waiting 35s It shows up twice. Unless someone can come up with a correctable reason for the failure I will delete the backup and go one. I hate to just write off 900 hours of crunching. <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code 193 (0xc1) </message> <stderr_txt> CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... 02:11:08 (7860): Can't acquire lockfile (32) - waiting 35s 02:11:15 (3840): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 23:50:55 (4200): Can't acquire lockfile (32) - waiting 35s 23:51:10 (7860): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Signal 11 received, exiting... Called boinc_finish </stderr_txt> ] |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The lock file problem could be several things, but is most likely caused by the anti virus program scanning each file that becomes active to check it for viruses before allowing you to use it. Some av programs are more aggressive than others in the way they work. Which is why it's long been recommended to block av scanning of the entire BOINC data section, both manually started scans AND scheduled scans. Backups: Here |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Still no closer to working out why my units crashed. I assume something may be in this from the errors but am afraid they mean nothing to me. I do know it isn't an antivirus program scanning anything in my case however. SIGABRT: abort called Stack trace (17 frames): ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x80b80df] [0xffffe400] [0xffffe430] /lib/libc.so.6(gsignal+0x51)[0xf7540ce1] /lib/libc.so.6(abort+0x182)[0xf7542632] /lib/libc.so.6(+0x65e4d)[0xf757ce4d] /lib/libc.so.6(+0x6bba1)[0xf7582ba1] /usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0xf7763321] /usr/lib/libstdc++.so.6(_ZdaPv+0x1d)[0xf776337d] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e8e] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a] /lib/libc.so.6(__libc_start_main+0xe6)[0xf752db96] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(__gxx_personality_v0+0x169)[0x804cb51] Dave |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Dave, I see there's a large number of "Suspended CPDN monitor" messages -- your computer is throttling back the tasks quite frequently. In the past, people have noticed stability problems when CPDN tasks are starved of resources. Have you tried this:- In Boinc Manager - Advanced menu - Preferences: On the "processor usage" tab, set 'Use at most ... % of CPU' to 100.00, or at a minimum, 80.00. (If you're worried about heat, it'd be best to set "On multiprocessor systems, use at most 50.00 % of the processors", i.e. process one task at a time.) On the "disk and memory usage" tab, ensure "Leave applications in memory when suspended" is selected. Also on this tab, for two HadCM3Ns in 2GB, it'd be best to set the Memory Usage figure "Use at most ... % when computer is in use" to at least 80.00 % to be on the safe side. Likewise for "Use at most ... when computer is idle". If you've done all these and still have the problem, then it might be worth:- * having a good vacuum-out of the CPU's heat sink, and unseating and re-seating the RAM modules * running mprime or memtest86+ for 48 hours to check your computer's RAM * upgrading its power supply to a newer, name brand model. That last item helped me with a stability problem. Newer PSUs seem to reject power supply noise better than ones from a few years ago. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Greg, Only change in settings was to increase % memory usage when computer is in use. Shouldn't be much dust in system as it is fairly new. I will probably increase memory soon to 4GB which may make a difference. With the regional models I don't seem to have any problems. Dave |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just done a temperature check - with two regional models running it is 45C, dropping to 41C if I disable one of the tasks.- Voltages seem to be stable too with less than .1v change in either 12V or 5V line when I stop a task. Dave |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
OK, Dave, that's good. Increasing the memory % might have done the trick. If not, the only thing left is to run mprime and memtest86+ for 24 - 48 hours each. You can't use the PC for anything else while memtest86+ is running, though. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Memory now up to 4GB. It is still going to be a long wait to see if the HADCM3 finishes or not. |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Yes, I estimate about 4 weeks with the PC running 24/7, less what has been done so far. Anyway, good luck! |
©2024 cpdn.org