climateprediction.net (CPDN) home page
Thread 'safr50 segmentation errors'

Thread 'safr50 segmentation errors'

Message boards : Number crunching : safr50 segmentation errors
Message board moderation

To post messages, you must log in.

AuthorMessage
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 60262 - Posted: 11 Jun 2019, 7:53:05 UTC

Hi there,
I've got at least 5 of safr50 WUs from batches 816, 817, 818, 820 that were killed with Signal 11 received: Segment violation. on Win and Win 10 machines.
I also noticed that previous safr50 also report the same errors (Batch 777, 789, 790)

Could there be something with this particular model? Can we enable some event log options that could provide more info?
Cheers
ID: 60262 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 60263 - Posted: 11 Jun 2019, 9:11:49 UTC - in response to Message 60262.  

I’ve had an 819 complete, which isn’t in your list. The upload Zips are very large - 90 MB - so maybe a heavily loaded machine might crash the model when generating those. In that case reducing the number of CPUs might help.

Another thing about those SAFR models is that the GFLOPs estimate is a significant overestimate - so the models run much more quickly than BOINC Manager suggests.
ID: 60263 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 60267 - Posted: 11 Jun 2019, 14:51:03 UTC - in response to Message 60263.  

The upload Zips are very large - 90 MB - so maybe a heavily loaded machine might crash the model when generating those. In that case reducing the number of CPUs might help.


Possibly but we have had batches with zips over 100MB and I don't remember any of them having so many failures. (Or perhaps I am just getting old and have forgotten.) And when I have looked at some of the failures on these batches, those with only 2 cores seem just as prone to the problem.
ID: 60267 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 60268 - Posted: 11 Jun 2019, 17:55:07 UTC

Now that you mention it, I have had 5 safr50's fail, and 6 complete OK thus far.

Completed:
three 820, one 819 and two 818.

Failed: two 816, two 817 and one 820.
The 820 was Signal 11
One 817 was Signal 11, and the other "Model crashed"
Both of the 816 were Signal 11

I have 10 Mbps upload (and just checked it), and normally don't have problems there.

So make of it what you will.
ID: 60268 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 60308 - Posted: 14 Jun 2019, 10:35:05 UTC

Two 817s have now failed for me now. They were run singly and the crashes were not associated with Zip file generation, so my earlier speculation is clearly wrong ...
ID: 60308 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 60323 - Posted: 16 Jun 2019, 11:23:40 UTC
Last modified: 16 Jun 2019, 11:25:13 UTC

I have now failed 9 WUs and my rough estimated is fail rate around 30-40%
all reported Signal 11. Since pagination does not really work I'm not going to look about the batch number of each WU, but they all are safr50
ID: 60323 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 60324 - Posted: 16 Jun 2019, 12:23:08 UTC

In case anyone cares, here is what the signal numbers mean. These are the definitions I know of for Linux machines. I would be surprised if they were different on other machines, though some may be absent.

/* Signals. */
#define SIGHUP 1 /* Hangup (POSIX). */
#define SIGINT 2 /* Interrupt (ANSI). */
#define SIGQUIT 3 /* Quit (POSIX). */
#define SIGILL 4 /* Illegal instruction (ANSI). */
#define SIGTRAP 5 /* Trace trap (POSIX). */
#define SIGABRT 6 /* Abort (ANSI). */
#define SIGIOT 6 /* IOT trap (4.2 BSD). */
#define SIGBUS 7 /* BUS error (4.2 BSD). */
#define SIGFPE 8 /* Floating-point exception (ANSI). */
#define SIGKILL 9 /* Kill, unblockable (POSIX). */
#define SIGUSR1 10 /* User-defined signal 1 (POSIX). */
#define SIGSEGV 11 /* Segmentation violation (ANSI). */
#define SIGUSR2 12 /* User-defined signal 2 (POSIX). */
#define SIGPIPE 13 /* Broken pipe (POSIX). */
#define SIGALRM 14 /* Alarm clock (POSIX). */
#define SIGTERM 15 /* Termination (ANSI). */
#define SIGSTKFLT 16 /* Stack fault. */
#define SIGCLD SIGCHLD /* Same as SIGCHLD (System V). */
#define SIGCHLD 17 /* Child status has changed (POSIX). */
#define SIGCONT 18 /* Continue (POSIX). */
#define SIGSTOP 19 /* Stop, unblockable (POSIX). */
#define SIGTSTP 20 /* Keyboard stop (POSIX). */
#define SIGTTIN 21 /* Background read from tty (POSIX). */
#define SIGTTOU 22 /* Background write to tty (POSIX). */
#define SIGURG 23 /* Urgent condition on socket (4.2 BSD). */
#define SIGXCPU 24 /* CPU limit exceeded (4.2 BSD). */
#define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */
#define SIGVTALRM 26 /* Virtual alarm clock (4.2 BSD). */
#define SIGPROF 27 /* Profiling alarm clock (4.2 BSD). */
#define SIGWINCH 28 /* Window size change (4.3 BSD, Sun). */
#define SIGPOLL SIGIO /* Pollable event occurred (System V). */
#define SIGIO 29 /* I/O now possible (4.2 BSD). */
#define SIGPWR 30 /* Power failure restart (System V). */
#define SIGSYS 31 /* Bad system call. */
#define SIGUNUSED 31
ID: 60324 · Report as offensive     Reply Quote

Message boards : Number crunching : safr50 segmentation errors

©2024 cpdn.org