climateprediction.net (CPDN) home page
Thread 'Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit'

Thread 'Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit'

Message boards : Number crunching : Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
San-Fernando-Valley

Send message
Joined: 16 Mar 16
Posts: 6
Credit: 858,545
RAC: 0
Message 56068 - Posted: 14 Apr 2017, 10:04:11 UTC
Last modified: 14 Apr 2017, 11:00:43 UTC

elapsed time: 00:00:14 on one, the other one 00:00:12.
I don't really know what is going on, but the name of the Executable makes me wonder if it is a 32-bit problem:
wah2_8.24_windows_intelx86.exe
I would appreciate your constructive feedback/solution!
I have searched in message board/forum for "computation error", but no hits.
Info: plenty RAM, plenty GB on disk, fast running machine - other projects work fine.
THANKS
"Not everything that can be counted counts, and not everything that counts can be counted."
- Albert Einstein (1879-1955)
ID: 56068 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56069 - Posted: 14 Apr 2017, 11:22:11 UTC

There are lots of errors with batch 561 so I wouldn't worry about your computer on that one. The one from batch 560 is an unknown. The "cannot find the drive specified" error happens from time to time, is kind of random, and hard to track down. I wouldn't worry yet about your computer and this project.

This project only has 32bit applications. There doesn't seem to be any problem with Windows and the 32bit apps. It is only in linux where 32bit libraries are not always installed on the computer where problems sometimes occur due to that.
ID: 56069 · Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 16 Mar 16
Posts: 6
Credit: 858,545
RAC: 0
Message 56070 - Posted: 14 Apr 2017, 14:32:30 UTC

"lots of errors ..." -- sounds nice for a production project.

Would probably be very helpful for novice users, if cp at n informs us of such problems ahead of time.

It is "no fun" waiting for an hour or more for delivery of WU just to experience a failure after a few seconds.

So, you would suggest to just wait for more WUs and to see if they run or not?

GREAT - Couldn't run my company that way ...

Have a nice day.
"Not everything that can be counted counts, and not everything that counts can be counted."
- Albert Einstein (1879-1955)
ID: 56070 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56073 - Posted: 14 Apr 2017, 15:32:36 UTC

I am currently running three 563 units (one of them is wah2_eu50r_mi90_20172_4_563_010995215_0), Weather At Home 2 (wah2) v8.25
i686-pc-linux-gnu on a Red Hat Enterprise Linux Server release 6.9 (64-bit machine) with the 32-bit compatibility libraries installed. They seem to be working fine and each has submitted a trickle.
ID: 56073 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 56074 - Posted: 14 Apr 2017, 16:45:28 UTC - in response to Message 56073.  

I am currently running three 563 units (one of them is wah2_eu50r_mi90_20172_4_563_010995215_0), Weather At Home 2 (wah2) v8.25
i686-pc-linux-gnu on a Red Hat Enterprise Linux Server release 6.9 (64-bit machine) with the 32-bit compatibility libraries installed. They seem to be working fine and each has submitted a trickle.

Yeah, I'm having no trouble with the 562 and 563 tasks, just the 561 tasks.
ID: 56074 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,884,997
RAC: 4,577
Message 56076 - Posted: 14 Apr 2017, 20:27:07 UTC - in response to Message 56070.  

[San-Fernando-Valley wrote:]lots of errors ..." -- sounds nice for a production project.

A collection of climate models, as run on CPDN, uses multiple parameters and initial condition inputs, not all of which lead to physically realistic climates. Those unrealistic climates are called errors in the BOINC system but represent perfectly proper topics for research. If everything about climate was already known there would be no need for research.

If batches of models are released with missing files or junk parameters then you will not be the only one who is frustrated. But exploring parameter space is what this project is about, and sometimes the limits are exceeded and the models will fail.
ID: 56076 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56083 - Posted: 15 Apr 2017, 14:15:01 UTC

Just had three 564 tasks (hadam3cs) fail between 2 and six minutes in. Two had previously failed, one with invalid theta and one with a sig segv fault.Segmentation violation was on a linux machine, the invalid theta windows. waiting till the 1 hour timeout finishes and mine report to see what it says for what happened on my boxes.
ID: 56083 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,706,848
RAC: 5,644
Message 56085 - Posted: 15 Apr 2017, 14:34:27 UTC - in response to Message 56083.  

Ok my first WU from batch 561 just failed on the Linux machine with SIGSEGV: segmentation violation. It ran for almost 8h.

A WINE machine is crunching another one, and I have few waiting on a Win7 machine. If the WINE's one fails then I may abort the others as well.

I've just moved up and started the only 564 hadam3cs I have to see if it will fail or not.
ID: 56085 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56086 - Posted: 15 Apr 2017, 14:50:51 UTC
Last modified: 15 Apr 2017, 14:57:30 UTC

All three 561's on my box failed with sigsegv segmentation violation Now have a 562 running but it has previously failed with a create thread error.
ID: 56086 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 56088 - Posted: 15 Apr 2017, 15:10:58 UTC - in response to Message 56085.  

I have 2 561's running on Wine, and they're still OK at 10.5 hours.
But failures on Linux.

I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue.
ID: 56088 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56089 - Posted: 15 Apr 2017, 15:34:39 UTC - in response to Message 56088.  

I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue.



I am not going to abort anything unless Someone at Oxford says I should do so. Am planning on turning my laptop over to it's Linux BOINC as soon as the wine tasks on it are finished.
ID: 56089 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,706,848
RAC: 5,644
Message 56091 - Posted: 15 Apr 2017, 15:48:56 UTC - in response to Message 56088.  
Last modified: 15 Apr 2017, 15:49:07 UTC

I have 2 561's running on Wine, and they're still OK at 10.5 hours.
But failures on Linux.

I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue.


Ok, I moved up all 561s on the win machine so they can start right after the 4x 560s it currently runs finish. There are 2 on a WINE machine, one in the queue, the other at 3.25 h.
ID: 56091 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 56097 - Posted: 16 Apr 2017, 1:45:28 UTC
Last modified: 16 Apr 2017, 1:45:54 UTC

Update on batch 561

5 out of 5 have failed on a Linux machine at about 4 hours,
2 are running on Wine at 20 hours,
and 2 are running on an old Windows 7 machine at 18.5 hours.
ID: 56097 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56099 - Posted: 16 Apr 2017, 2:48:54 UTC - in response to Message 56073.  
Last modified: 16 Apr 2017, 2:49:14 UTC

My three 563 work units completed successfully, though no credit has been issued yet (not a problem).

Currently running a 505 and two 560 work units. The each have over 12 hours on them.

The 505 has about 307 hours to go. The 560s have about 181 hours to go.
ID: 56099 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 56101 - Posted: 16 Apr 2017, 7:07:56 UTC - in response to Message 56091.  

I have 2 561's running on Wine, and they're still OK at 10.5 hours.
But failures on Linux.

I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue.


Ok, I moved up all 561s on the win machine so they can start right after the 4x 560s it currently runs finish. There are 2 on a WINE machine, one in the queue, the other at 3.25 h.


Seems like a good tactic -- if they fail they fail soon and the submitters get feedback faster. Doing that on my one Linux machine that still has 361's in process

Me, all (or almost) of the 561's on Linux have failed less than a day with sigsegv and no more diagnostics

On wine emulating win7 or mostly win 10 and on a virtual and real win10 the 561's don't seem to fail much.

any cluse anyone?
ID: 56101 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 56103 - Posted: 16 Apr 2017, 8:39:26 UTC - in response to Message 56101.  

It may be to do with the compile, and what options were selected.

Oxford is aware of the problems, and when the Easter weekend is over, I'll send another email with my observations.
ID: 56103 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56108 - Posted: 16 Apr 2017, 11:32:57 UTC - in response to Message 56099.  
Last modified: 16 Apr 2017, 11:35:12 UTC

Too bad.

Name wah2_eas50_n58q_201412_10_505_010860235_2
Workunit 10860235
Created 15 Apr 2017, 13:33:08 UTC
Sent 15 Apr 2017, 13:39:24 UTC
Report deadline 28 Mar 2018, 18:59:24 UTC
Received 16 Apr 2017, 6:34:50 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1256552
Run time 14 hours 30 min 44 sec
CPU time 13 hours 28 min 30 sec
Validate state Invalid
Credit 0.00

stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
SIGSEGV: segmentation violation
ID: 56108 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56109 - Posted: 16 Apr 2017, 12:55:55 UTC - in response to Message 56108.  

Les is speculating on another thread whether the sigsegv problem under Linux is actually the same as the invalid theta on Windows, - the result of the model straying off into an impossible climate.
ID: 56109 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 56112 - Posted: 16 Apr 2017, 18:38:16 UTC - in response to Message 56109.  

Les is speculating on another thread whether the sigsegv problem under Linux is actually the same as the invalid theta on Windows, - the result of the model straying off into an impossible climate.


Even if the model fails (with "invalid theta?), it should not cause a segmentation fault. About the only thing that would cause a segmentation fault is dereferencing an undefined pointer, or going out of range of a subscript, and neither should ever happen in a correctly written program. I believe the ClimatePrediction programs are written in FORTRAN, so there will be no pointers.

If a program gets bad data, it should not fail with a segmentation fault. And programs like this are very hard to debug, though not as bad as in the old days.

Once I had a FORTRAN program that crashed the FORTRAN compiler so badly that it crashed the OS. (It turns out the FORTRAN program I was trying to compile had no errors.) Easy to do in those days because there was no memory management unit in the computer. I quickly found the problem: the compiler was overwriting the bottom 64 addresses of physical memory, so the interrupt vectors disappeared. So no more IO. Everything stopped. But how to find where in the compiler this was happening? I wrote an interpreter program that took the binary file that was the compiler and ran it in a simulated computer just like the real one, but it had a memory management unit. It ran about 20x slower than the real machine, but that was OK. It stopped as soon as the compiler hit the instruction that wrote on the interrupt vectors. Then it took only a few minutes to fix.

These days, the OS can (and does) stop the program that tries to write outside of its (virtual) address space, and it can point to just where the program was at the time.

Here is the rest of the error messages.
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
SIGSEGV: segmentation violation
Stack trace (13 frames):
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357]
[0x55555400]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d]
/lib/libc.so.6(__libc_start_main+0xe6)[0x352d26]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1]

Exiting...

People with the source code and a debugger should be able to find the instruction causing the segmentation fault.
ID: 56112 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 56117 - Posted: 16 Apr 2017, 21:20:22 UTC - in response to Message 56112.  

The big calculating program is in FORTRAN, but the ones that the project people can change is some version of C.
This is the part that feeds the main program with all of the data from a large number of descriptions and data files. (Triffid, Moses II, and possibly more.)
Unless the data files feed Triffid and Moses.

It's all horribly complicated, and not something that I want to get into. Probably not most of us.

One question to ask is: What is the message issued in Linux, and on a Mac, for the equivalent of INVALID THETA in Windows.
ID: 56117 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit

©2024 cpdn.org