Message boards : Number crunching : safr50 segmentation errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Hi there, I've got at least 5 of safr50 WUs from batches 816, 817, 818, 820 that were killed with Signal 11 received: Segment violation. on Win and Win 10 machines. I also noticed that previous safr50 also report the same errors (Batch 777, 789, 790) Could there be something with this particular model? Can we enable some event log options that could provide more info? Cheers |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
I’ve had an 819 complete, which isn’t in your list. The upload Zips are very large - 90 MB - so maybe a heavily loaded machine might crash the model when generating those. In that case reducing the number of CPUs might help. Another thing about those SAFR models is that the GFLOPs estimate is a significant overestimate - so the models run much more quickly than BOINC Manager suggests. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946 |
The upload Zips are very large - 90 MB - so maybe a heavily loaded machine might crash the model when generating those. In that case reducing the number of CPUs might help. Possibly but we have had batches with zips over 100MB and I don't remember any of them having so many failures. (Or perhaps I am just getting old and have forgotten.) And when I have looked at some of the failures on these batches, those with only 2 cores seem just as prone to the problem. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Now that you mention it, I have had 5 safr50's fail, and 6 complete OK thus far. Completed: three 820, one 819 and two 818. Failed: two 816, two 817 and one 820. The 820 was Signal 11 One 817 was Signal 11, and the other "Model crashed" Both of the 816 were Signal 11 I have 10 Mbps upload (and just checked it), and normally don't have problems there. So make of it what you will. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Two 817s have now failed for me now. They were run singly and the crashes were not associated with Zip file generation, so my earlier speculation is clearly wrong ... |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I have now failed 9 WUs and my rough estimated is fail rate around 30-40% all reported Signal 11. Since pagination does not really work I'm not going to look about the batch number of each WU, but they all are safr50 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In case anyone cares, here is what the signal numbers mean. These are the definitions I know of for Linux machines. I would be surprised if they were different on other machines, though some may be absent. /* Signals. */ #define SIGHUP 1 /* Hangup (POSIX). */ #define SIGINT 2 /* Interrupt (ANSI). */ #define SIGQUIT 3 /* Quit (POSIX). */ #define SIGILL 4 /* Illegal instruction (ANSI). */ #define SIGTRAP 5 /* Trace trap (POSIX). */ #define SIGABRT 6 /* Abort (ANSI). */ #define SIGIOT 6 /* IOT trap (4.2 BSD). */ #define SIGBUS 7 /* BUS error (4.2 BSD). */ #define SIGFPE 8 /* Floating-point exception (ANSI). */ #define SIGKILL 9 /* Kill, unblockable (POSIX). */ #define SIGUSR1 10 /* User-defined signal 1 (POSIX). */ #define SIGSEGV 11 /* Segmentation violation (ANSI). */ #define SIGUSR2 12 /* User-defined signal 2 (POSIX). */ #define SIGPIPE 13 /* Broken pipe (POSIX). */ #define SIGALRM 14 /* Alarm clock (POSIX). */ #define SIGTERM 15 /* Termination (ANSI). */ #define SIGSTKFLT 16 /* Stack fault. */ #define SIGCLD SIGCHLD /* Same as SIGCHLD (System V). */ #define SIGCHLD 17 /* Child status has changed (POSIX). */ #define SIGCONT 18 /* Continue (POSIX). */ #define SIGSTOP 19 /* Stop, unblockable (POSIX). */ #define SIGTSTP 20 /* Keyboard stop (POSIX). */ #define SIGTTIN 21 /* Background read from tty (POSIX). */ #define SIGTTOU 22 /* Background write to tty (POSIX). */ #define SIGURG 23 /* Urgent condition on socket (4.2 BSD). */ #define SIGXCPU 24 /* CPU limit exceeded (4.2 BSD). */ #define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */ #define SIGVTALRM 26 /* Virtual alarm clock (4.2 BSD). */ #define SIGPROF 27 /* Profiling alarm clock (4.2 BSD). */ #define SIGWINCH 28 /* Window size change (4.3 BSD, Sun). */ #define SIGPOLL SIGIO /* Pollable event occurred (System V). */ #define SIGIO 29 /* I/O now possible (4.2 BSD). */ #define SIGPWR 30 /* Power failure restart (System V). */ #define SIGSYS 31 /* Bad system call. */ #define SIGUNUSED 31 |
©2024 cpdn.org