Thread 'Replanca Error/Sigseg fault.'

Author	Message
bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56439 - Posted: 22 Jun 2017, 21:00:07 UTC - in response to Message 56438. So I am letting mine run - have suspended work ahead of them in the queue to try and help resolve this issue as quickly as possible. Ok then I will do the same and leave my two linux machines crunch 592s One question being posed is whether it is the Natural Greenhouse Gas or other forcing files that are the issue. Can I check this and provide feedback? ID: 56439 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 56441 - Posted: 22 Jun 2017, 22:29:02 UTC - in response to Message 56438. I've suspended some of mine to get a couple of 592 tasks to the front of the queue on my Win machine. ID: 56441 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 56443 - Posted: 23 Jun 2017, 2:20:52 UTC On my 4-core 64-bit Xeon machine, I got my first segmentation fault in a long time. wah2_sas50_namx_201612_8_592_011103518_0 Workunit 11103518 Created 21 Jun 2017, 17:06:08 UTC Sent 22 Jun 2017, 9:04:36 UTC Report deadline 4 Jun 2018, 14:24:36 UTC Received 23 Jun 2017, 0:39:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x0) Computer ID 1256552 Run time 14 hours 1 min 12 sec CPU time 12 hours 46 min 6 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.28 GFLOPS Application version Weather At Home 2 (wah2) v8.25 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x55555400] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/libc.so.6(__libc_start_main+0xe6)[0x30ed26] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=15331, iMonCtr=2 Model crash detected, will try to restart... Leaving CPDN_ain::Monitor... Calling boinc_finish...19:46:08 (15331): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_8.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_restart.zip</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> ID: 56443 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56444 - Posted: 23 Jun 2017, 6:51:56 UTC Can I check this and provide feedback? All of batch 592 and two other batches that have this problem have natural GHG forcing. Hi all, I think from what you have been saying that these failures are seeming to affect the â€œNaturalâ€ batches and not the â€œActualâ€ ones as it seems as if batches 589 and 591 were ok and batches 590, 592 and 583 which are all â€œNaturalâ€ forcing batches. We have tried in a local run swapping the SST and Sea Ice fields and get the same answer so we are wondering if it could be an issue with the GHG forcing (which is updated once a year) or other natural forcing files that are causing the issue. As I say we are actively trying local tests at the moment to try and work out what is happening here. Interestingly in one of the local runs that we did leaving the working directories in place that failed then continued to run to completion (so would have restarted running day 1 of the year in the global model again and going on to the regional model). Therefore you may find (if you happen to catch it) that if you suspend the job while it is running the first day of the new year in the global model and then restart it that it will then run to completion. This sort of error is making us think that it could be a memory issue somewhereâ€¦ As I say any info gratefully received here on this! Best wishes, Sarah I have asked about an easy way to tell about catching job while running first day of new year in the absence of the graphics that used to tell us the model time as well as the timestep that would have made this easy. ID: 56444 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56445 - Posted: 23 Jun 2017, 8:08:22 UTC - in response to Message 56444. And with regards to the date the model is up to: In the working directory there is a file stdout_mon.txt if you tail that file it will say the date that the model it up to. Entries in it will look something like: wah2_sas50_n50o_201612_1_d750_000005907 - PH 1 TS 0011611 A - 01/01/2017 22:45 - H:M:S=0007:18:06 AVG= 2.26 DLT= 1.87 The â€œAâ€ before the date corresponds to running the global model and when it turns to â€œPâ€ then it is running the regional model. ID: 56445 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56446 - Posted: 23 Jun 2017, 9:41:25 UTC - in response to Message 56445. Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message I am sure there will be a way around this but I haven't looked deeply enough into the tail command yet to find it. Works fine sudo tail filename as a single shot to see where the task is up to. ID: 56446 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56447 - Posted: 23 Jun 2017, 11:07:45 UTC - in response to Message 56445. Last modified: 23 Jun 2017, 11:10:01 UTC And with regards to the date the model is up to: In the working directory there is a file stdout_mon.txt if you tail that file it will say the date that the model it up to. Entries in it will look something like: wah2_sas50_n50o_201612_1_d750_000005907 - PH 1 TS 0011611 A - 01/01/2017 22:45 - H:M:S=0007:18:06 AVG= 2.26 DLT= 1.87 The â€œAâ€ before the date corresponds to running the global model and when it turns to â€œPâ€ then it is running the regional model. So if I understood correctly, I could monitor this file and once/if a WU fails I should post back here the last line, before the project clears up the directories. Two more failed on Linux - this one and this one ID: 56447 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 56448 - Posted: 23 Jun 2017, 13:17:49 UTC - in response to Message 56446. Last modified: 23 Jun 2017, 13:19:08 UTC Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message I am sure there will be a way around this but I haven't looked deeply enough into the tail command yet to find it. Works fine sudo tail filename as a single shot to see where the task is up to. Just use the -f option and it will sit there and scroll in the terminal window with each timestep. All mine that have failed with this sigsegv error fail on the first timestep of the regional model on Jan 1st. ID: 56448 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56449 - Posted: 23 Jun 2017, 14:29:14 UTC Last modified: 23 Jun 2017, 14:46:56 UTC It looks the third one of mine also crashed on the first timestep of the regional model on Jan 1st wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011615 A - 01/01/2017 23:45 - H:M:S=0014:39:38 AVG= 4.54 DLT= 3.72 wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011616 A - 02/01/2017 00:00 - H:M:S=0014:39:41 AVG= 4.54 DLT= 3.70 wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:39:50 AVG= 4.54 DLT= 8.78 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... Uploading out files... Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip The 4th one I did not trace. I'm tracing 3 more. ID: 56449 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56450 - Posted: 23 Jun 2017, 15:27:47 UTC - in response to Message 56449. Last modified: 23 Jun 2017, 15:58:24 UTC Thanks George, currently 30-12-2016 12:15 so not long to go on the one I am monitoring. Though that was global now on the regional bit of the day and up to 18:20 ID: 56450 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56452 - Posted: 23 Jun 2017, 19:27:34 UTC - in response to Message 56450. And all three failed during first timestep of regional bit of first day 2017 ID: 56452 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 56453 - Posted: 23 Jun 2017, 23:17:58 UTC - in response to Message 56452. wah2_sas50_nc8r_201612_8_592_011105600_0 failed after 1 trickle on Win. Have got 4 others running and a couple of others in the queue. ID: 56453 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56454 - Posted: 24 Jun 2017, 4:21:09 UTC - in response to Message 56449. Unfortunately I wasn't able to suspend them as suggested and all 3 failed wah2_sas50_ncpy_201612_8_592_011106219 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:10:02 AVG= 4.39 DLT= 8.74 Model crash detected, will try to restart... wah2_sas50_n85n_201612_8_592_011100304 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:13:27 AVG= 3.79 DLT= 7.10 Model crash detected, will try to restart... wah2_sas50_ncmn_201612_8_592_011106100 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:14:34 AVG= 3.79 DLT= 7.42 Model crash detected, will try to restart... I have few more but will not be around to monitor and suspend them, so they will most probably fail as well ID: 56454 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56455 - Posted: 24 Jun 2017, 5:00:15 UTC - in response to Message 56454. Unfortunately I wasn't able to suspend them as suggested and all 3 failed Two out of three, I was able to suspend, I am assuming because of the same percentage completed before the crash that the third fell over at the exact same point. Project people believe they are getting closer to identifying the problem but not there yet. ID: 56455 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 56458 - Posted: 24 Jun 2017, 9:21:17 UTC - in response to Message 56453. wah2_sas50_n8z3_201612_8_592_011101364_0 and wah2_sas50_n8kf_201612_8_592_011100836_0 both up to t/s 46,379 if this info helps. ID: 56458 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,944,701 RAC: 2,164	Message 56469 - Posted: 26 Jun 2017, 11:29:37 UTC Two SAS50/8 from batch #592 have failed on my Mac at the same point and before sending the first Zip. Two models from batch #592 have completed successfully on my Windows machine. ID: 56469 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56471 - Posted: 26 Jun 2017, 15:25:12 UTC - in response to Message 56454. All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion. I have 4 running on a win machine and they seem fine. ID: 56471 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56472 - Posted: 26 Jun 2017, 17:24:18 UTC - in response to Message 56471. All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion. Seems pretty universal, even if suspended during first day. Project have been advised. ID: 56472 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904	Message 56572 - Posted: 27 Jul 2017, 8:28:57 UTC Last modified: 27 Jul 2017, 9:08:17 UTC I have 3 WUs from batch 617 that failed on my Linux box. Two failed with SIGSEGV: segmentation violation after 14 h, https://www.cpdn.org/cpdnboinc/result.php?resultid=20564889 https://www.cpdn.org/cpdnboinc/result.php?resultid=20566748 the third one crashed at the 8 minute with Model crashed: Leaving CPDN_ain::Monitor... Calling boinc_finish...09:30:49 (16432): called boinc_finish(0) In boinc_exit called with status 0 Calloing set_signal_exit_code with status 0 EDIT: The third one seem to be fine on windows as it produced 3 trickles already I have few of that batch on two linux machines ID: 56572 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 56573 - Posted: 27 Jul 2017, 9:56:47 UTC - in response to Message 56572. Last modified: 27 Jul 2017, 10:01:30 UTC I have a retread on Linux that has already failed once on Darwin with a sigseg fault after about 8 hours. I have moved it to the top of the queue to see what happens. I should say that it is looking likely that batches where a significant number of tasks fall over are not going to be uncommon. The restart files from these batches will often form the basis for a follow up batch which because the initial conditions have not forced it into an impossible climate etc. will have a much higher success rate. ID: 56573 · Reply Quote