Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
So I am letting mine run - have suspended work ahead of them in the queue to try and help resolve this issue as quickly as possible. Ok then I will do the same and leave my two linux machines crunch 592s One question being posed is whether it is the Natural Greenhouse Gas or other forcing files that are the issue. Can I check this and provide feedback? |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,365,622 RAC: 15,545 |
I've suspended some of mine to get a couple of 592 tasks to the front of the queue on my Win machine. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
On my 4-core 64-bit Xeon machine, I got my first segmentation fault in a long time. wah2_sas50_namx_201612_8_592_011103518_0 Workunit 11103518 Created 21 Jun 2017, 17:06:08 UTC Sent 22 Jun 2017, 9:04:36 UTC Report deadline 4 Jun 2018, 14:24:36 UTC Received 23 Jun 2017, 0:39:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x0) Computer ID 1256552 Run time 14 hours 1 min 12 sec CPU time 12 hours 46 min 6 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.28 GFLOPS Application version Weather At Home 2 (wah2) v8.25 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x55555400] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/libc.so.6(__libc_start_main+0xe6)[0x30ed26] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=15331, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...19:46:08 (15331): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_8.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_restart.zip</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Can I check this and provide feedback? All of batch 592 and two other batches that have this problem have natural GHG forcing. Hi all, I have asked about an easy way to tell about catching job while running first day of new year in the absence of the graphics that used to tell us the model time as well as the timestep that would have made this easy. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
And with regards to the date the model is up to: In the working directory there is a file stdout_mon.txt if you tail that file it will say the date that the model it up to. Entries in it will look something like: wah2_sas50_n50o_201612_1_d750_000005907 - PH 1 TS 0011611 A - 01/01/2017 22:45 - H:M:S=0007:18:06 AVG= 2.26 DLT= 1.87 The “A” before the date corresponds to running the global model and when it turns to “P” then it is running the regional model. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message I am sure there will be a way around this but I haven't looked deeply enough into the tail command yet to find it. Works fine sudo tail filename as a single shot to see where the task is up to. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
And with regards to the date the model is up to: So if I understood correctly, I could monitor this file and once/if a WU fails I should post back here the last line, before the project clears up the directories. Two more failed on Linux - this one and this one |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message Just use the -f option and it will sit there and scroll in the terminal window with each timestep. All mine that have failed with this sigsegv error fail on the first timestep of the regional model on Jan 1st. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
It looks the third one of mine also crashed on the first timestep of the regional model on Jan 1st
wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011616 A - 02/01/2017 00:00 - H:M:S=0014:39:41 AVG= 4.54 DLT= 3.70 wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:39:50 AVG= 4.54 DLT= 8.78 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... Uploading out files... Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip
|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Thanks George, currently 30-12-2016 12:15 so not long to go on the one I am monitoring. Though that was global now on the regional bit of the day and up to 18:20 |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
And all three failed during first timestep of regional bit of first day 2017 |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,365,622 RAC: 15,545 |
wah2_sas50_nc8r_201612_8_592_011105600_0 failed after 1 trickle on Win. Have got 4 others running and a couple of others in the queue. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
Unfortunately I wasn't able to suspend them as suggested and all 3 failed wah2_sas50_ncpy_201612_8_592_011106219 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:10:02 AVG= 4.39 DLT= 8.74 Model crash detected, will try to restart... wah2_sas50_n85n_201612_8_592_011100304 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:13:27 AVG= 3.79 DLT= 7.10 Model crash detected, will try to restart... wah2_sas50_ncmn_201612_8_592_011106100 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:14:34 AVG= 3.79 DLT= 7.42 Model crash detected, will try to restart... I have few more but will not be around to monitor and suspend them, so they will most probably fail as well |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Unfortunately I wasn't able to suspend them as suggested and all 3 failed Two out of three, I was able to suspend, I am assuming because of the same percentage completed before the crash that the third fell over at the exact same point. Project people believe they are getting closer to identifying the problem but not there yet. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,365,622 RAC: 15,545 |
wah2_sas50_n8z3_201612_8_592_011101364_0 and wah2_sas50_n8kf_201612_8_592_011100836_0 both up to t/s 46,379 if this info helps. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,884,997 RAC: 4,577 |
Two SAS50/8 from batch #592 have failed on my Mac at the same point and before sending the first Zip. Two models from batch #592 have completed successfully on my Windows machine. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion. I have 4 running on a win machine and they seem fine. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion. Seems pretty universal, even if suspended during first day. Project have been advised. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
I have 3 WUs from batch 617 that failed on my Linux box. Two failed with SIGSEGV: segmentation violation after 14 h, https://www.cpdn.org/cpdnboinc/result.php?resultid=20564889 https://www.cpdn.org/cpdnboinc/result.php?resultid=20566748 the third one crashed at the 8 minute with Model crashed: Leaving CPDN_ain::Monitor... Calling boinc_finish...09:30:49 (16432): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 EDIT: The third one seem to be fine on windows as it produced 3 trickles already I have few of that batch on two linux machines |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I have a retread on Linux that has already failed once on Darwin with a sigseg fault after about 8 hours. I have moved it to the top of the queue to see what happens. I should say that it is looking likely that batches where a significant number of tasks fall over are not going to be uncommon. The restart files from these batches will often form the basis for a follow up batch which because the initial conditions have not forced it into an impossible climate etc. will have a much higher success rate. |
©2024 cpdn.org