Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
Two wah2_sas50_... _8_590s exited with SIGSEGV: segmentation violation after almost 14h of crunching. They had less than 10 mins difference in running time. This one is now crunched under apple Darwin & the second one under win machines. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
My understanding is it is believed that not all of this batch will fail under Linux and or Darwin. I have suspended some of my other tasks so now have three running to see what happens. Edit: This one has got past the fourth month on Darwin so unless the problem is only on Linux this time with luck some at least will finish on non-windows machines. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Did something funny when posting link should be https://www.cpdn.org/cpdnboinc/result.php?resultid=20502269 |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Been digging around a bit, Could not find any Linux boxes which have returned zips. High failure rate with Darwin though did find one that has returned 4 zips, also higher than usual failure rate with Windows on those I looked at though quite a few have returned four or five zips. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
So far, on Linux, all my 589 tasks are running fine through several months. The two 590 tasks crashed after 1 month, January 1st as the regional part of the model started running. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
Got two more from batch 590 that failed on my Linux box after 13h 52-54 mins, it looks they failed close to he same spot as the other two. I have few more queued (suspended for now) and two running. Waiting for advice whether I should abort them. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Please abort any 590 tasks on Linux. I am awaiting clarification with regards to other platforms. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
Thanks Dave, all 590 under Linux aborted. Two 591 started I hope they will be ok. 589 are at 28% |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
"Yes that batch was released as wah2_ri incorrectly, but will be the same app code that it is running as we have recently made the wah2 an wah2_ri code the same. There was a error with the template though and these will fail at final upload (so feel free to abort workunits from batches 590 and 589)." Looks like 589 should be aborted as well. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
In fact everyone should abort 589 and 590. They have been replaced with 591 and 592 which are out there to be picked up. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
In fact everyone should abort 589 and 590. They have been replaced with 591 and 592 which are out there to be picked up. Does it refer to ALL operating systems? |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Does it refer to ALL operating systems? Yes, because of an incorrect header they will fail at the end even if they get that far. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I presume this is related to the tasks being listed as Weather At Home 2 (wah2) (region independent) v8.25 i686-pc-linux-gnu Despite showing as the, "Ordinary" WAH2 when they were in the tasks ready to send table on the server status page. One of us should really have spotted that sooner. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
This Linux WU from batch 592 has failed on other's linux box at around 13h45 mins similar to 589 & 590. I got its resubmission (_1) in my queue now. I can move it up, but it will start in approx 100h, unless I suspend 591s. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Not sure it is worth moving it up. - picking some 592 tasks at random I have found 4 tasks running under Darwin that seem to have failed just before creation of first zip so I suspect the problem will be there on Linux as well and these may well have to be aborted. Am about to email project. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
Not sure it is worth moving it up. - picking some 592 tasks at random I have found 4 tasks running under Darwin that seem to have failed just before creation of first zip so I suspect the problem will be there on Linux as well and these may well have to be aborted. Am about to email project. I have two 592s on another linux box at 12% - close to first zip ~ 15h running time. Will post here on progress. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I have two 592s on another linux box at 12% - close to first zip ~ 15h running time. Will post here on progress. That was my thinking, there would be enough already running not to have to move any up to get the information we want. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,709,333 RAC: 5,769 |
|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Email sent to project. This is being looked at but don't think there is any real progress yet. It is a matter of taking things out and seeing what changes and so far the results don't seem to make a lot of sense from what I can gather. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Hi, One question being posed is whether it is the Natural Greenhouse Gas or other forcing files that are the issue. - One Darwin task from batch 592 has made it past it's fourth month so clearly it doesn't affect every task. Also some Linux ones have made it past this point. I and the project people will be following this closely as clearly a solution is in everyone's interest. So I am letting mine run - have suspended work ahead of them in the queue to try and help resolve this issue as quickly as possible. |
©2024 cpdn.org