climateprediction.net (CPDN) home page
Thread 'Batch 777 safr50'

Thread 'Batch 777 safr50'

Message boards : Number crunching : Batch 777 safr50
Message board moderation

To post messages, you must log in.

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59165 - Posted: 14 Dec 2018, 14:25:54 UTC

9 hours in, and running OK.
Big zips - just under 90 Megs.
ID: 59165 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59166 - Posted: 14 Dec 2018, 21:51:50 UTC

I've lost one, just before the 1st zip, which is probably par for the course.
ID: 59166 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 59213 - Posted: 21 Dec 2018, 5:36:41 UTC
Last modified: 21 Dec 2018, 6:13:44 UTC

I noticed yesterday that several of this batch had failed on a new machine I've been monitoring more closely.
Looking further over recent workunit failures of this batch on my machines, seems like

At about the halfway point they fail with signal 11. This happens on Intel and AMD, on Windows 10 both in virtual and real machines, and with wine on Ubuntu bionic and Debian stretch. Probably less than a third of the wu's fail like this -- sample size too small to make better estimate.

Anybody else notice this?
Thinking this is something the batch creators will figure out.
<edit>
The majority of these rather short workunits seem to complete and upload OK
ID: 59213 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59214 - Posted: 21 Dec 2018, 6:27:19 UTC

Apparently there were problems with some of the WUs early on, but the later ones were corrected, and there is a reasonable percentage succeeding.

So far, I've had 6 OK, and 2 failed.
ID: 59214 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59219 - Posted: 22 Dec 2018, 0:02:46 UTC - in response to Message 59214.  

2 failed after 9th trickle. 3 more in progress with crossed fingers.
ID: 59219 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 59221 - Posted: 22 Dec 2018, 2:50:07 UTC

While I've had about a dozen succeed, I've had 4 fail with signal 11 errors across 3 PCs after the 8th or 9th trickle.
ID: 59221 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59222 - Posted: 22 Dec 2018, 5:26:30 UTC

I've just had 2 more fail at the 8th zip with signal 11.

The 4 on "this" computer have just gone past zip 12.
ID: 59222 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 59223 - Posted: 22 Dec 2018, 8:28:44 UTC - in response to Message 59222.  

I got 2 that failed prior 11 zip. Both WUs created 13 Dec
ID: 59223 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59224 - Posted: 22 Dec 2018, 9:11:09 UTC - in response to Message 59223.  

And here, just the one retread which should start later today.
ID: 59224 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59240 - Posted: 26 Dec 2018, 9:15:07 UTC

A safr50 batch 781 just failed on me after 2 days 18 hours, so maybe it is not just batch 777?
https://www.cpdn.org/cpdnboinc/result.php?resultid=21448556
ID: 59240 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59241 - Posted: 26 Dec 2018, 9:23:27 UTC

I've had a lot of 777, and a few 779 fail, all at the zip 8-9 area.
I emailed the project about it earlier today.

One computer was left empty, and the replacements were a 780, and 3 781s.

This "Signal 11 received: Segment violation" at the zip 8-9 point is getting suspicious.

But I've also had a few get past there.
ID: 59241 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 59251 - Posted: 27 Dec 2018, 9:02:05 UTC - in response to Message 59241.  

I have a failed 16_780 also with "Signal 11 received: Segment violation" after zip 9.

I have 3 more running from 780 and 781 and I expect them to fail ;)
ID: 59251 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59252 - Posted: 27 Dec 2018, 15:53:03 UTC - in response to Message 59241.  

This machine failed on one 781 after the 8th trickle, but has gone past 4 days and 13 trickles on the others ( 1-777, 2-780, 2-781). So it is not much worse than normal, just a bit strange to be so repeatable.

https://www.cpdn.org/cpdnboinc/results.php?hostid=1466534
ID: 59252 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 59253 - Posted: 27 Dec 2018, 16:31:43 UTC - in response to Message 59213.  

In case anyone wants to know, the following are the Linux (and perhaps also the UNIX) signal numbers. In particular, signal 11 is a segmentation violation. This is an attempt to reference a memory location that is not in the address space of the process. Unless this is due to a hardware failure, it is due to a software error, usually in the process getting the segmentation fault. Dereferencing a pointer that has no value, or an incorrect value, using an array element that is undefined (lower than the lowest subscript or greater than the highest subscript), and so on are the usual ways to get this. Another way to get this is to allocate additional memory and use it, and then free that memory, and then to try to use it again. This is actually the same as dereferencing an invalid pointer. In any case, all of these require fixing the bug in the program. Doing this is usually quite easy in most versions of Linux. Set the system to give a core dump of the process, and examine just where the fault occurred. The stack trace can help, but usually the gdb debugger can help more.

/* Signals. */
#define SIGHUP 1 /* Hangup (POSIX). */
#define SIGINT 2 /* Interrupt (ANSI). */
#define SIGQUIT 3 /* Quit (POSIX). */
#define SIGILL 4 /* Illegal instruction (ANSI). */
#define SIGTRAP 5 /* Trace trap (POSIX). */
#define SIGABRT 6 /* Abort (ANSI). */
#define SIGIOT 6 /* IOT trap (4.2 BSD). */
#define SIGBUS 7 /* BUS error (4.2 BSD). */
#define SIGFPE 8 /* Floating-point exception (ANSI). */
#define SIGKILL 9 /* Kill, unblockable (POSIX). */
#define SIGUSR1 10 /* User-defined signal 1 (POSIX). */
#define SIGSEGV 11 /* Segmentation violation (ANSI). */
#define SIGUSR2 12 /* User-defined signal 2 (POSIX). */
#define SIGPIPE 13 /* Broken pipe (POSIX). */
#define SIGALRM 14 /* Alarm clock (POSIX). */
#define SIGTERM 15 /* Termination (ANSI). */
#define SIGSTKFLT 16 /* Stack fault. */
#define SIGCLD SIGCHLD /* Same as SIGCHLD (System V). */
#define SIGCHLD 17 /* Child status has changed (POSIX). */
#define SIGCONT 18 /* Continue (POSIX). */
#define SIGSTOP 19 /* Stop, unblockable (POSIX). */
#define SIGTSTP 20 /* Keyboard stop (POSIX). */
#define SIGTTIN 21 /* Background read from tty (POSIX). */
#define SIGTTOU 22 /* Background write to tty (POSIX). */
#define SIGURG 23 /* Urgent condition on socket (4.2 BSD). */
#define SIGXCPU 24 /* CPU limit exceeded (4.2 BSD). */
#define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */
#define SIGVTALRM 26 /* Virtual alarm clock (4.2 BSD). */
#define SIGPROF 27 /* Profiling alarm clock (4.2 BSD). */
#define SIGWINCH 28 /* Window size change (4.3 BSD, Sun). */
#define SIGPOLL SIGIO /* Pollable event occurred (System V). */
#define SIGIO 29 /* I/O now possible (4.2 BSD). */
#define SIGPWR 30 /* Power failure restart (System V). */
#define SIGSYS 31 /* Bad system call. */
#define SIGUNUSED 31
ID: 59253 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59254 - Posted: 27 Dec 2018, 21:01:07 UTC - in response to Message 59252.  

This machine failed on one 781 after the 8th trickle, but has gone past 4 days and 13 trickles on the others ( 1-777, 2-780, 2-781).

Both 781s failed after the 13th trickle. So it must be using quantum computing techniques, since it fails when you look at them. Very advanced.
ID: 59254 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59256 - Posted: 29 Dec 2018, 6:00:51 UTC

My last 2 (batch 781's) on the Ivy Bridge got REPLANCA'd just before the 14th zip, so that's it for that machine.
Shut down now because of the heatwave.
ID: 59256 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59259 - Posted: 29 Dec 2018, 13:23:53 UTC - in response to Message 59256.  

Shut down now because of the heatwave.

I just looked at the weather. That is awful. I am building a Ryzen 5 2600 and will do it for you. It is 10 C here.
ID: 59259 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59266 - Posted: 30 Dec 2018, 3:54:56 UTC

The last 2 batch 781 on the Haswell got REPLANCA'd, but the batch 780 finished OK.

So, if these get through the Segment violation at zip9-10, they have a chance at finishing.
ID: 59266 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59267 - Posted: 30 Dec 2018, 8:10:55 UTC

Thanks Jim. :)
Still bad here, and nature looks like putting on a light show of it's own for NYE.
ID: 59267 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 59277 - Posted: 31 Dec 2018, 19:05:08 UTC - in response to Message 59266.  

One 781 finished at 100% with Computation error

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xadae.pipe_dummy
Leaving CPDN_ain::Monitor...
20:49:31 (30): called boinc_finish(0)
....

the last for 2018 :)
ID: 59277 · Report as offensive     Reply Quote

Message boards : Number crunching : Batch 777 safr50

©2024 cpdn.org