climateprediction.net (CPDN) home page
Thread 'Replanca Error/Sigseg fault.'

Thread 'Replanca Error/Sigseg fault.'

Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56406 - Posted: 21 Jun 2017, 5:12:58 UTC
Last modified: 21 Jun 2017, 5:16:27 UTC

Two wah2_sas50_... _8_590s exited with SIGSEGV: segmentation violation after almost 14h of crunching. They had less than 10 mins difference in running time.
This one is now crunched under apple Darwin & the second one under win machines.
ID: 56406 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56407 - Posted: 21 Jun 2017, 8:52:17 UTC - in response to Message 56406.  
Last modified: 21 Jun 2017, 9:05:23 UTC

My understanding is it is believed that not all of this batch will fail under Linux and or Darwin. I have suspended some of my other tasks so now have three running to see what happens.
Edit:
This one has got past the fourth month on Darwin so unless the problem is only on Linux this time with luck some at least will finish on non-windows machines.
ID: 56407 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56410 - Posted: 21 Jun 2017, 14:49:27 UTC - in response to Message 56407.  

Did something funny when posting link should be

https://www.cpdn.org/cpdnboinc/result.php?resultid=20502269
ID: 56410 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56412 - Posted: 21 Jun 2017, 15:35:13 UTC - in response to Message 56410.  

Been digging around a bit, Could not find any Linux boxes which have returned zips. High failure rate with Darwin though did find one that has returned 4 zips, also higher than usual failure rate with Windows on those I looked at though quite a few have returned four or five zips.
ID: 56412 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56415 - Posted: 21 Jun 2017, 16:38:40 UTC

So far, on Linux, all my 589 tasks are running fine through several months. The two 590 tasks crashed after 1 month, January 1st as the regional part of the model started running.
ID: 56415 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56416 - Posted: 21 Jun 2017, 16:41:12 UTC - in response to Message 56407.  
Last modified: 21 Jun 2017, 16:48:15 UTC

Got two more from batch 590 that failed on my Linux box after 13h 52-54 mins, it looks they failed close to he same spot as the other two. I have few more queued (suspended for now) and two running. Waiting for advice whether I should abort them.
ID: 56416 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56418 - Posted: 21 Jun 2017, 17:59:49 UTC - in response to Message 56416.  

Please abort any 590 tasks on Linux. I am awaiting clarification with regards to other platforms.
ID: 56418 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56420 - Posted: 21 Jun 2017, 18:09:31 UTC - in response to Message 56418.  

Thanks Dave, all 590 under Linux aborted. Two 591 started I hope they will be ok. 589 are at 28%
ID: 56420 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56421 - Posted: 21 Jun 2017, 18:14:58 UTC - in response to Message 56420.  

"Yes that batch was released as wah2_ri incorrectly, but will be the same app code that it is running as we have recently made the wah2 an wah2_ri code the same. There was a error with the template though and these will fail at final upload (so feel free to abort workunits from batches 590 and 589)."


Looks like 589 should be aborted as well.
ID: 56421 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56423 - Posted: 21 Jun 2017, 18:41:22 UTC - in response to Message 56421.  

In fact everyone should abort 589 and 590. They have been replaced with 591 and 592 which are out there to be picked up.
ID: 56423 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56424 - Posted: 21 Jun 2017, 19:56:42 UTC - in response to Message 56423.  

In fact everyone should abort 589 and 590. They have been replaced with 591 and 592 which are out there to be picked up.


Does it refer to ALL operating systems?
ID: 56424 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56426 - Posted: 21 Jun 2017, 20:40:19 UTC

Does it refer to ALL operating systems?


Yes, because of an incorrect header they will fail at the end even if they get that far.
ID: 56426 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56427 - Posted: 22 Jun 2017, 6:00:16 UTC - in response to Message 56426.  

I presume this is related to the tasks being listed as Weather At Home 2 (wah2) (region independent) v8.25
i686-pc-linux-gnu Despite showing as the, "Ordinary" WAH2 when they were in the tasks ready to send table on the server status page. One of us should really have spotted that sooner.
ID: 56427 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56428 - Posted: 22 Jun 2017, 9:55:48 UTC
Last modified: 22 Jun 2017, 9:56:09 UTC

This Linux WU from batch 592 has failed on other's linux box at around 13h45 mins similar to 589 & 590. I got its resubmission (_1) in my queue now. I can move it up, but it will start in approx 100h, unless I suspend 591s.
ID: 56428 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56430 - Posted: 22 Jun 2017, 11:20:10 UTC - in response to Message 56428.  

Not sure it is worth moving it up. - picking some 592 tasks at random I have found 4 tasks running under Darwin that seem to have failed just before creation of first zip so I suspect the problem will be there on Linux as well and these may well have to be aborted. Am about to email project.
ID: 56430 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56431 - Posted: 22 Jun 2017, 11:25:15 UTC - in response to Message 56430.  

Not sure it is worth moving it up. - picking some 592 tasks at random I have found 4 tasks running under Darwin that seem to have failed just before creation of first zip so I suspect the problem will be there on Linux as well and these may well have to be aborted. Am about to email project.


I have two 592s on another linux box at 12% - close to first zip ~ 15h running time. Will post here on progress.
ID: 56431 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56432 - Posted: 22 Jun 2017, 11:55:44 UTC - in response to Message 56431.  

I have two 592s on another linux box at 12% - close to first zip ~ 15h running time. Will post here on progress.


That was my thinking, there would be enough already running not to have to move any up to get the information we want.
ID: 56432 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,710,161
RAC: 5,793
Message 56433 - Posted: 22 Jun 2017, 14:03:32 UTC - in response to Message 56432.  
Last modified: 22 Jun 2017, 14:05:09 UTC

They both failed. Here is info on the 1st and 2nd one from batch 592. All others are suspended.
ID: 56433 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56434 - Posted: 22 Jun 2017, 14:22:36 UTC - in response to Message 56433.  

Email sent to project.

This is being looked at but don't think there is any real progress yet. It is a matter of taking things out and seeing what changes and so far the results don't seem to make a lot of sense from what I can gather.
ID: 56434 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56438 - Posted: 22 Jun 2017, 19:13:36 UTC - in response to Message 56434.  
Last modified: 22 Jun 2017, 19:56:10 UTC

Hi,

Yes I believe that this will be affecting the Linux and probably Mac builds (but we are not sure whether this is consistently as some wah2 tasks run past this on Linux). If it is just affecting Natural forcing batches this may help us work out the fix. We are currently actively investigating this with some local runs and will keep you posted. As this runs ok on Windows and doesn’t necessarily affect all runs I would rather leave the batch as is at the moment.

Best wishes,
Sarah


One question being posed is whether it is the Natural Greenhouse Gas or other forcing files that are the issue. - One Darwin task from batch 592 has made it past it's fourth month so clearly it doesn't affect every task. Also some Linux ones have made it past this point. I and the project people will be following this closely as clearly a solution is in everyone's interest.


So I am letting mine run - have suspended work ahead of them in the queue to try and help resolve this issue as quickly as possible.
ID: 56438 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Replanca Error/Sigseg fault.

©2024 cpdn.org