Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 69 · 70 · 71 · 72 · 73 · 74 · 75 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,999,920 RAC: 14,617 |
Two got - two failed. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then. For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program. Different operating systems, too When I was working on optimizers, the designers of the regular C-compiler suite used one version of the UNIX kernel and we used a slightly different one. Actually the same source, but we turned on a special testing option that made the hardware give an interrupt just prior to the segmentation violation. Then we diddled the compilaton system to leave the bottom page of RAM unused. And we set that page to no access. Thus any attempt to access that bottom page gave us an interrupt that we could analyze. One of the programs we compiled and tested was the UNIX kernel itself, that we would run and test. We optimized the kernel as well as everyting else. ANd we found bugs in it. The just did not care and never fixed the bugs we pointed out to them. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242 |
...we found bugs in it. The just did not care and never fixed the bugs we pointed out to them. So depressing and yet so common. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
The problem with batch 926, is that there are some bad data sets in among some good data sets. That’s fair enough, if they fail they fail very quickly. |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
I wrote: my iMac … picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly. Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units. WU 12129231 WU 12129961 WU 12127970 NG |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units. That has been my impression too. However there does seem to be a large element of randomness about it. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,850,897 RAC: 19,923 |
Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242 |
Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out. Errors involving “invalid theta” indicate that the model’s physics has become unrealistic, so the model is stopped. For a computer that is not over-clocked such errors can be ignored. The model is likely to fail in the same way on other PCs with the same architecture, but might succeed, for example, on a Mac with a different floating-point library. It’s all part of the ensemble method of modelling: the project knows that some models are on the edge. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program. Sorry: I spelled his name wrong. https://www.cs.cornell.edu/gries/ |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
That’s fair enough, if they fail they fail very quickly. They sure do. I just got two more that failed after 36 seconds wall clock time, and just under 4 seconds cpu time. Segmentation violations. https://www.cpdn.org/result.php?resultid=22181507 https://www.cpdn.org/result.php?resultid=22182265 |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
My last few to download have all failed. Two more downloading right now. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
That’s fair enough, if they fail they fail very quickly. I’ve realised a problem with my logic - when they fail the hour delay cuts in before it can pull down the next task to try. In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine. I was away for a couple of hours, and three more failed. All with segmentation error. One of mine that failed also failed on another machine running Darwin: it does not seem to like machines running Darwin. But my Linux machine is 1511241 |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,999,920 RAC: 14,617 |
The 5 I got over the weekend and the 2 today have all failed. My machine is Ubuntu 20.04. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
On the other side of things, I started a new one today, but it's hadm4h, batch 895. Phew. :) Hmmm That batch is almost a year old; the last attempt on mine was abandoned after a year, without any trickles. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
On the other side of things, I started a new one today, but it's hadm4h, batch 895. I get old ones sometimes too, Some failed several times before I get them. And for different reasons. hadam4h_e0zp_207111_5_887_012043123_3 hadam4h_h1g7_201011_5_889_012045366_3 hadam4h_b0v6_201211_5_882_012036130_1 |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,999,920 RAC: 14,617 |
Same here. Suspended pending 926 tasks and started the 895 one just to check if it would run OK. Looking good at the moment. There seem to be a lot of computers that are "idle" for several months. Extra grist to the mill for a shortening of the allowed/estimated completion time before reissuing the task. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Suspended pending 926 tasks and started the 895 one just to check if it would run OK. I take them as they come. My N216 tasks mostly run OK, but all 13 or so 926 task have failed with segmentation faults after 2 to 4 seconds of processor time. These have all been in 2022 January. The same model has worked OK in the past. I have one more downloaded, but there are 4 CPDN tasks in the queue before that one.. I can run up to 4 CPDN tasks at a time on this machine. Actually, I have an 8-core machine that can actually run 16 tasks at a time, but it does not make sense for me to run more than 8 Boinc tasks at a time. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
A total of 107 of #926 are now showing as completed. I will have to wait till some N216 tasks complete before I can try any more. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Right now my machine is idle except for 8 Boinc tasks. Of these, three are N216 CPDN tasks and five are WCG models. Of those, two are OPN1, two are ARP1, and one is MCM1. The essentials of my machine are: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB <---<<< top - 02:13:32 up 16 days, 12:00, 1 user, load average: 8.08, 8.25, 8.31 Tasks: 463 total, 9 running, 453 sleeping, 1 stopped, 0 zombie %Cpu(s): 0.3 us, 2.8 sy, 47.0 ni, 49.7 id, 0.0 wa, 0.1 hi, 0.1 si, 0.0 st MiB Mem : 63902.2 total, 867.4 free, 9746.1 used, 53288.6 buff/cache MiB Swap: 15992.0 total, 15043.0 free, 949.0 used. 53304.0 avail Mem Note: with all the RAM I have, it does essentially no paging. Now let us look at the cache hit ratio. Almost half of the memory requests are found in the cache with this procesor. # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 37,810,375,576 cache-references 20,135,725,239 cache-misses # 53.254 % of all cache refs 60.935538193 seconds time elapsed |
©2024 cpdn.org