Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
One failed with this; I have no idea how this happened.Yes, I have seen looking at failed tasks a number with the process creation failure. Never experienced it myself. I suspect this happens on machines where the user has messed around with them and it is a permissions issue but in the absence of a user with that error posting in the forums.... |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
One failed with this; I have no idea how this happened. Yes: it seems to me the best way to get that error would be to remove the file file after the task was downloaded (so the client would put it in the ready-to-start list but before it actually was started. I guess it would suffice to change it to a non-executable permission |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,850,897 RAC: 19,923 |
For me, HadCM3s are crashing on i7-4790 WSL2 Ubuntu 20.04 with NAMELIST errors but HadAM4s are running fine. So I tried HadCM3s on Hyper-V Ubuntu 20.04 on Ryzen 5900x and so far they're working fine. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
My first one completed and uploaded successfully Just had one with segmentation violation but previous computer it failed on was missing libs. One that failed with Segmentation violation 40seconds in on first computer has made it to 18 minutes so I suspect OK. But it tells me that it is not just the computer at fault as mind is running most of these tasks fine. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
From Sarah in response to my observations. yes I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault) but others are ok. We don't know which are the duff ones without running through this batch but would effectively filter out in any continuation batches (hopefully!) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault) I think mine all quit in a very few seconds with a segmentation fault. I cannot imagine they got any significant computing done, surely not enough that they xrashed from bad parameters -- though I could be wrong I suppose. Computer ID 1511241 Run time 36 sec CPU time 2 sec <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> SIGSEGV: segmentation violation |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
FWIW, my iMac's just completed and reported two of the HadCM3s, which it received three days ago. It picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly. As I write, both tasks have been running on my machine for 26 hours and have returned four and three trickles respectively. I've no idea why my iMac should be more successful with these than other machines. One thought is that when I joined this project nearly twelve years ago, I followed a tip that was linked to in the help pages at the time for quadrupling a Mac's shared memory allocation. But I don't know if it's relevant. NG |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Two more data points, both from the same machine. task 22185585 task 22185426 CM3 short tasks, failed with SIGSEGV and 'too many model crashes'. The first had failed twice before, on machines with missing libraries: the second also failed with 'too many model crashes', but without the accompanying SIGSEGV reports. My machine is Linux Mint 20.2, Intel CPU, 16GB memory. I updated the OS with the latest patches (including a kernel update) before running these tasks: everything else started fine, and two AM4 tasks are now running just fine (as they usually do). Machine is available for any further analysis that may be wanted. |
Send message Joined: 28 Oct 17 Posts: 1 Credit: 1,390,220 RAC: 0 |
Recently moved to Linux after a PC upgrade and finished setting up Boinc earlier today. So far, all 5 of the HadCM3 tasks I got have failed via SIGSEGV after 37s runtime / 2s cpu time in the same fashion as Jean-David Beyer's. Meanwhile, 4 HadAM4 WUs are running fine. The tasks in question are: 22185746, 22184615, 22183973, 22184912, 22181490 I'm using an x64 install of Debian-based MX Linux on a Ryzen 5600X with 32G of RAM. Latest updates and 32bit libraries are installed. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Another three completed successfully here. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Ubuntu 21.10 and BOINC 7.19.0 (The odd number after 7. indicates a pre-release version I compiled from source. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB 7.16.11 is the latest version for my Linux distribution. My machine is having no trouble with hadam4h work units and has completed HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu and HadCM3 short v8.36 i686-pc-linux-gnu onits in the past. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles. Task https://www.cpdn.org/result.php?resultid=22185636 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines. 1.) Just how much computing is actually accomplished in the first two seconds of a work-unit? Does it even do more than initialize things? 2.) No matter what, nothing justifies a segmentation violation even if a computation does something that violates physical reality constraints, or such as dividing by zero. An error exit, yes, but a segmentation violation, no. Even bad programs should not do this. The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles. I think I have receives six of the suspect work units. However many, all have failed. I have not received an more of them. I have received N216 work units and they all work or have at least one day of computing accomplished.One has 5 1/2 days accomplished so far. They take my machine about 8 days to run one., though some run in about 6 days. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Six fails today with segmentation violations and 5 successful completions. The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program. As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work. Edit, only 12 successful completions showing for the batch so far when I have had seven against my six failures is a high enough failure rate that merits further investigation. (There were no failures in the ones that went to the testing site but there may just not have been enough of them to be statistically significant.) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work. I did not mean to imply (if I did) that the problems I attribute to bad code of the applications were the fault of the ClimatePrediction project and should be fixed by them Back when I was working on an optimizer for the Bell Labs C-compilation system, it often turned out that optimized code gave radically different results, including segmentation faults, from unoptimized code. Naturally those who wrote the normal compilation suite blamed these differences on the optimizer program. In every case, we could show that the fault was that bad pointers or array subscripts were the cause, and the different results were due to the fact that the un-optimized code failed differently. For example, the optimizer overloaded data registers; i.e., if a register's content was not used afterwards, we could put a hot variable in it. Now, later the program used the contents of that register without putting anything in it. So the optimized version had one variable left in it, and the un-optimized version had another value in it. So of course the results were different. It should be sufficient for an optimizer to give the same results for correct programs. It should not be required that it give the same results for incorrect programs. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,025,554 RAC: 20,468 |
Another variable ruled out - one lot that failed were all when I had 8 threads running. I tried reducing this down to three when two new ones were starting just now and they both failed. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The problem with batch 926, is that there are some bad data sets in among some good data sets. They can either be all killed off, or they can all be left to run, which will quickly remove the bad ones. The 2nd option is being used. |
©2024 cpdn.org