Message boards : Number crunching : Computation error Weather At Home 2 (wah2) 8.24 on WIN7 64-bit
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Mar 16 Posts: 6 Credit: 858,545 RAC: 0 |
elapsed time: 00:00:14 on one, the other one 00:00:12. I don't really know what is going on, but the name of the Executable makes me wonder if it is a 32-bit problem: wah2_8.24_windows_intelx86.exe I would appreciate your constructive feedback/solution! I have searched in message board/forum for "computation error", but no hits. Info: plenty RAM, plenty GB on disk, fast running machine - other projects work fine. THANKS "Not everything that can be counted counts, and not everything that counts can be counted." - Albert Einstein (1879-1955) |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
There are lots of errors with batch 561 so I wouldn't worry about your computer on that one. The one from batch 560 is an unknown. The "cannot find the drive specified" error happens from time to time, is kind of random, and hard to track down. I wouldn't worry yet about your computer and this project. This project only has 32bit applications. There doesn't seem to be any problem with Windows and the 32bit apps. It is only in linux where 32bit libraries are not always installed on the computer where problems sometimes occur due to that. |
Send message Joined: 16 Mar 16 Posts: 6 Credit: 858,545 RAC: 0 |
"lots of errors ..." -- sounds nice for a production project. Would probably be very helpful for novice users, if cp at n informs us of such problems ahead of time. It is "no fun" waiting for an hour or more for delivery of WU just to experience a failure after a few seconds. So, you would suggest to just wait for more WUs and to see if they run or not? GREAT - Couldn't run my company that way ... Have a nice day. "Not everything that can be counted counts, and not everything that counts can be counted." - Albert Einstein (1879-1955) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am currently running three 563 units (one of them is wah2_eu50r_mi90_20172_4_563_010995215_0), Weather At Home 2 (wah2) v8.25 i686-pc-linux-gnu on a Red Hat Enterprise Linux Server release 6.9 (64-bit machine) with the 32-bit compatibility libraries installed. They seem to be working fine and each has submitted a trickle. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I am currently running three 563 units (one of them is wah2_eu50r_mi90_20172_4_563_010995215_0), Weather At Home 2 (wah2) v8.25 Yeah, I'm having no trouble with the 562 and 563 tasks, just the 561 tasks. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,884,997 RAC: 4,577 |
[San-Fernando-Valley wrote:]lots of errors ..." -- sounds nice for a production project. A collection of climate models, as run on CPDN, uses multiple parameters and initial condition inputs, not all of which lead to physically realistic climates. Those unrealistic climates are called errors in the BOINC system but represent perfectly proper topics for research. If everything about climate was already known there would be no need for research. If batches of models are released with missing files or junk parameters then you will not be the only one who is frustrated. But exploring parameter space is what this project is about, and sometimes the limits are exceeded and the models will fail. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Just had three 564 tasks (hadam3cs) fail between 2 and six minutes in. Two had previously failed, one with invalid theta and one with a sig segv fault.Segmentation violation was on a linux machine, the invalid theta windows. waiting till the 1 hour timeout finishes and mine report to see what it says for what happened on my boxes. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
Ok my first WU from batch 561 just failed on the Linux machine with SIGSEGV: segmentation violation. It ran for almost 8h. A WINE machine is crunching another one, and I have few waiting on a Win7 machine. If the WINE's one fails then I may abort the others as well. I've just moved up and started the only 564 hadam3cs I have to see if it will fail or not. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
All three 561's on my box failed with sigsegv segmentation violation Now have a 562 running but it has previously failed with a create thread error. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I have 2 561's running on Wine, and they're still OK at 10.5 hours. But failures on Linux. I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I think people should continue with these, as it's the only way that Oxford can see what percentage fail, and if they need more study before a re-issue. I am not going to abort anything unless Someone at Oxford says I should do so. Am planning on turning my laptop over to it's Linux BOINC as soon as the wine tasks on it are finished. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
I have 2 561's running on Wine, and they're still OK at 10.5 hours. Ok, I moved up all 561s on the win machine so they can start right after the 4x 560s it currently runs finish. There are 2 on a WINE machine, one in the queue, the other at 3.25 h. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Update on batch 561 5 out of 5 have failed on a Linux machine at about 4 hours, 2 are running on Wine at 20 hours, and 2 are running on an old Windows 7 machine at 18.5 hours. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My three 563 work units completed successfully, though no credit has been issued yet (not a problem). Currently running a 505 and two 560 work units. The each have over 12 hours on them. The 505 has about 307 hours to go. The 560s have about 181 hours to go. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I have 2 561's running on Wine, and they're still OK at 10.5 hours. Seems like a good tactic -- if they fail they fail soon and the submitters get feedback faster. Doing that on my one Linux machine that still has 361's in process Me, all (or almost) of the 561's on Linux have failed less than a day with sigsegv and no more diagnostics On wine emulating win7 or mostly win 10 and on a virtual and real win10 the 561's don't seem to fail much. any cluse anyone? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It may be to do with the compile, and what options were selected. Oxford is aware of the problems, and when the Easter weekend is over, I'll send another email with my observations. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Too bad. Name wah2_eas50_n58q_201412_10_505_010860235_2 Workunit 10860235 Created 15 Apr 2017, 13:33:08 UTC Sent 15 Apr 2017, 13:39:24 UTC Report deadline 28 Mar 2018, 18:59:24 UTC Received 16 Apr 2017, 6:34:50 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x0) Computer ID 1256552 Run time 14 hours 30 min 44 sec CPU time 13 hours 28 min 30 sec Validate state Invalid Credit 0.00 stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... SIGSEGV: segmentation violation |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Les is speculating on another thread whether the sigsegv problem under Linux is actually the same as the invalid theta on Windows, - the result of the model straying off into an impossible climate. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Les is speculating on another thread whether the sigsegv problem under Linux is actually the same as the invalid theta on Windows, - the result of the model straying off into an impossible climate. Even if the model fails (with "invalid theta?), it should not cause a segmentation fault. About the only thing that would cause a segmentation fault is dereferencing an undefined pointer, or going out of range of a subscript, and neither should ever happen in a correctly written program. I believe the ClimatePrediction programs are written in FORTRAN, so there will be no pointers. If a program gets bad data, it should not fail with a segmentation fault. And programs like this are very hard to debug, though not as bad as in the old days. Once I had a FORTRAN program that crashed the FORTRAN compiler so badly that it crashed the OS. (It turns out the FORTRAN program I was trying to compile had no errors.) Easy to do in those days because there was no memory management unit in the computer. I quickly found the problem: the compiler was overwriting the bottom 64 addresses of physical memory, so the interrupt vectors disappeared. So no more IO. Everything stopped. But how to find where in the compiler this was happening? I wrote an interpreter program that took the binary file that was the compiler and ran it in a simulated computer just like the real one, but it had a memory management unit. It ran about 20x slower than the real machine, but that was OK. It stopped as soon as the compiler hit the instruction that wrote on the interrupt vectors. Then it took only a few minutes to fix. These days, the OS can (and does) stop the program that tries to write outside of its (virtual) address space, and it can point to just where the program was at the time. Here is the rest of the error messages. stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... SIGSEGV: segmentation violation Stack trace (13 frames): /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357] [0x55555400] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d] /lib/libc.so.6(__libc_start_main+0xe6)[0x352d26] /home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1] Exiting... People with the source code and a debugger should be able to find the instruction causing the segmentation fault. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The big calculating program is in FORTRAN, but the ones that the project people can change is some version of C. This is the part that feeds the main program with all of the data from a large number of descriptions and data files. (Triffid, Moses II, and possibly more.) Unless the data files feed Triffid and Moses. It's all horribly complicated, and not something that I want to get into. Probably not most of us. One question to ask is: What is the message issued in Linux, and on a Mac, for the equivalent of INVALID THETA in Windows. |
©2024 cpdn.org