Message boards : Number crunching : Batch 777 safr50
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
9 hours in, and running OK. Big zips - just under 90 Megs. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've lost one, just before the 1st zip, which is probably par for the course. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I noticed yesterday that several of this batch had failed on a new machine I've been monitoring more closely. Looking further over recent workunit failures of this batch on my machines, seems like At about the halfway point they fail with signal 11. This happens on Intel and AMD, on Windows 10 both in virtual and real machines, and with wine on Ubuntu bionic and Debian stretch. Probably less than a third of the wu's fail like this -- sample size too small to make better estimate. Anybody else notice this? Thinking this is something the batch creators will figure out. <edit> The majority of these rather short workunits seem to complete and upload OK |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Apparently there were problems with some of the WUs early on, but the later ones were corrected, and there is a reasonable percentage succeeding. So far, I've had 6 OK, and 2 failed. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
2 failed after 9th trickle. 3 more in progress with crossed fingers. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
While I've had about a dozen succeed, I've had 4 fail with signal 11 errors across 3 PCs after the 8th or 9th trickle. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've just had 2 more fail at the 8th zip with signal 11. The 4 on "this" computer have just gone past zip 12. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I got 2 that failed prior 11 zip. Both WUs created 13 Dec |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,005,674 RAC: 21,647 |
And here, just the one retread which should start later today. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
A safr50 batch 781 just failed on me after 2 days 18 hours, so maybe it is not just batch 777? https://www.cpdn.org/cpdnboinc/result.php?resultid=21448556 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've had a lot of 777, and a few 779 fail, all at the zip 8-9 area. I emailed the project about it earlier today. One computer was left empty, and the replacements were a 780, and 3 781s. This "Signal 11 received: Segment violation" at the zip 8-9 point is getting suspicious. But I've also had a few get past there. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I have a failed 16_780 also with "Signal 11 received: Segment violation" after zip 9. I have 3 more running from 780 and 781 and I expect them to fail ;) |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
This machine failed on one 781 after the 8th trickle, but has gone past 4 days and 13 trickles on the others ( 1-777, 2-780, 2-781). So it is not much worse than normal, just a bit strange to be so repeatable. https://www.cpdn.org/cpdnboinc/results.php?hostid=1466534 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In case anyone wants to know, the following are the Linux (and perhaps also the UNIX) signal numbers. In particular, signal 11 is a segmentation violation. This is an attempt to reference a memory location that is not in the address space of the process. Unless this is due to a hardware failure, it is due to a software error, usually in the process getting the segmentation fault. Dereferencing a pointer that has no value, or an incorrect value, using an array element that is undefined (lower than the lowest subscript or greater than the highest subscript), and so on are the usual ways to get this. Another way to get this is to allocate additional memory and use it, and then free that memory, and then to try to use it again. This is actually the same as dereferencing an invalid pointer. In any case, all of these require fixing the bug in the program. Doing this is usually quite easy in most versions of Linux. Set the system to give a core dump of the process, and examine just where the fault occurred. The stack trace can help, but usually the gdb debugger can help more. /* Signals. */ #define SIGHUP 1 /* Hangup (POSIX). */ #define SIGINT 2 /* Interrupt (ANSI). */ #define SIGQUIT 3 /* Quit (POSIX). */ #define SIGILL 4 /* Illegal instruction (ANSI). */ #define SIGTRAP 5 /* Trace trap (POSIX). */ #define SIGABRT 6 /* Abort (ANSI). */ #define SIGIOT 6 /* IOT trap (4.2 BSD). */ #define SIGBUS 7 /* BUS error (4.2 BSD). */ #define SIGFPE 8 /* Floating-point exception (ANSI). */ #define SIGKILL 9 /* Kill, unblockable (POSIX). */ #define SIGUSR1 10 /* User-defined signal 1 (POSIX). */ #define SIGSEGV 11 /* Segmentation violation (ANSI). */ #define SIGUSR2 12 /* User-defined signal 2 (POSIX). */ #define SIGPIPE 13 /* Broken pipe (POSIX). */ #define SIGALRM 14 /* Alarm clock (POSIX). */ #define SIGTERM 15 /* Termination (ANSI). */ #define SIGSTKFLT 16 /* Stack fault. */ #define SIGCLD SIGCHLD /* Same as SIGCHLD (System V). */ #define SIGCHLD 17 /* Child status has changed (POSIX). */ #define SIGCONT 18 /* Continue (POSIX). */ #define SIGSTOP 19 /* Stop, unblockable (POSIX). */ #define SIGTSTP 20 /* Keyboard stop (POSIX). */ #define SIGTTIN 21 /* Background read from tty (POSIX). */ #define SIGTTOU 22 /* Background write to tty (POSIX). */ #define SIGURG 23 /* Urgent condition on socket (4.2 BSD). */ #define SIGXCPU 24 /* CPU limit exceeded (4.2 BSD). */ #define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */ #define SIGVTALRM 26 /* Virtual alarm clock (4.2 BSD). */ #define SIGPROF 27 /* Profiling alarm clock (4.2 BSD). */ #define SIGWINCH 28 /* Window size change (4.3 BSD, Sun). */ #define SIGPOLL SIGIO /* Pollable event occurred (System V). */ #define SIGIO 29 /* I/O now possible (4.2 BSD). */ #define SIGPWR 30 /* Power failure restart (System V). */ #define SIGSYS 31 /* Bad system call. */ #define SIGUNUSED 31 |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
This machine failed on one 781 after the 8th trickle, but has gone past 4 days and 13 trickles on the others ( 1-777, 2-780, 2-781). Both 781s failed after the 13th trickle. So it must be using quantum computing techniques, since it fails when you look at them. Very advanced. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
My last 2 (batch 781's) on the Ivy Bridge got REPLANCA'd just before the 14th zip, so that's it for that machine. Shut down now because of the heatwave. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Shut down now because of the heatwave. I just looked at the weather. That is awful. I am building a Ryzen 5 2600 and will do it for you. It is 10 C here. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The last 2 batch 781 on the Haswell got REPLANCA'd, but the batch 780 finished OK. So, if these get through the Segment violation at zip9-10, they have a chance at finishing. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Thanks Jim. :) Still bad here, and nature looks like putting on a light show of it's own for NYE. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
One 781 finished at 100% with Computation error Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xadae.pipe_dummy Leaving CPDN_ain::Monitor... 20:49:31 (30): called boinc_finish(0) .... the last for 2018 :) |
©2024 cpdn.org