Thread 'New work Discussion'

Author	Message
Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 64944 - Posted: 9 Jan 2022, 23:33:26 UTC - in response to Message 64898. Two got - two failed. ID: 64944 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64945 - Posted: 10 Jan 2022, 6:13:22 UTC - in response to Message 64941. Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then. For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program. Different operating systems, too When I was working on optimizers, the designers of the regular C-compiler suite used one version of the UNIX kernel and we used a slightly different one. Actually the same source, but we turned on a special testing option that made the hardware give an interrupt just prior to the segmentation violation. Then we diddled the compilaton system to leave the bottom page of RAM unused. And we set that page to no access. Thus any attempt to access that bottom page gave us an interrupt that we could analyze. One of the programs we compiled and tested was the UNIX kernel itself, that we would run and test. We optimized the kernel as well as everyting else. ANd we found bugs in it. The just did not care and never fixed the bugs we pointed out to them. ID: 64945 ·

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242	Message 64946 - Posted: 10 Jan 2022, 10:18:11 UTC - in response to Message 64945. ...we found bugs in it. The just did not care and never fixed the bugs we pointed out to them. So depressing and yet so common. ID: 64946 ·

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 64947 - Posted: 10 Jan 2022, 11:23:03 UTC - in response to Message 64943. The problem with batch 926, is that there are some bad data sets in among some good data sets. They can either be all killed off, or they can all be left to run, which will quickly remove the bad ones. The 2nd option is being used. That’s fair enough, if they fail they fail very quickly. ID: 64947 ·

Nigel Garvey Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258	Message 64951 - Posted: 10 Jan 2022, 19:52:50 UTC - in response to Message 64926. Last modified: 10 Jan 2022, 19:53:21 UTC I wrote: my iMac … picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly. Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units. WU 12129231 WU 12129961 WU 12127970 NG ID: 64951 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 64952 - Posted: 10 Jan 2022, 20:35:10 UTC - in response to Message 64951. Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units. That has been my impression too. However there does seem to be a large element of randomness about it. ID: 64952 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,852,553 RAC: 19,917	Message 64953 - Posted: 10 Jan 2022, 21:18:46 UTC Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out. ID: 64953 ·

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,815,352 RAC: 5,242	Message 64954 - Posted: 10 Jan 2022, 21:25:28 UTC - in response to Message 64953. Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out. Errors involving “invalid theta” indicate that the model’s physics has become unrealistic, so the model is stopped. For a computer that is not over-clocked such errors can be ignored. The model is likely to fail in the same way on other PCs with the same architecture, but might succeed, for example, on a Mac with a different floating-point library. It’s all part of the ensemble method of modelling: the project knows that some models are on the edge. ID: 64954 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64955 - Posted: 11 Jan 2022, 4:01:30 UTC - in response to Message 64945. For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program. Sorry: I spelled his name wrong. https://www.cs.cornell.edu/gries/ ID: 64955 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64958 - Posted: 12 Jan 2022, 16:08:23 UTC - in response to Message 64947. That’s fair enough, if they fail they fail very quickly. They sure do. I just got two more that failed after 36 seconds wall clock time, and just under 4 seconds cpu time. Segmentation violations. https://www.cpdn.org/result.php?resultid=22181507 https://www.cpdn.org/result.php?resultid=22182265 ID: 64958 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 64959 - Posted: 12 Jan 2022, 16:46:49 UTC My last few to download have all failed. Two more downloading right now. ID: 64959 ·

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 64960 - Posted: 12 Jan 2022, 16:55:18 UTC - in response to Message 64958. That’s fair enough, if they fail they fail very quickly. They sure do. I just got two more that failed after 36 seconds wall clock time, and just under 4 seconds cpu time. Segmentation violations. https://www.cpdn.org/result.php?resultid=22181507 https://www.cpdn.org/result.php?resultid=22182265 I’ve realised a problem with my logic - when they fail the hour delay cuts in before it can pull down the next task to try. In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine. ID: 64960 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64961 - Posted: 12 Jan 2022, 19:47:34 UTC - in response to Message 64960. In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine. I was away for a couple of hours, and three more failed. All with segmentation error. One of mine that failed also failed on another machine running Darwin: it does not seem to like machines running Darwin. But my Linux machine is 1511241 ID: 64961 ·

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 64963 - Posted: 12 Jan 2022, 23:14:48 UTC - in response to Message 64959. The 5 I got over the weekend and the 2 today have all failed. My machine is Ubuntu 20.04. ID: 64963 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 64964 - Posted: 12 Jan 2022, 23:48:30 UTC On the other side of things, I started a new one today, but it's hadm4h, batch 895. Phew. :) Hmmm That batch is almost a year old; the last attempt on mine was abandoned after a year, without any trickles. ID: 64964 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64965 - Posted: 13 Jan 2022, 3:17:48 UTC - in response to Message 64964. On the other side of things, I started a new one today, but it's hadm4h, batch 895. Phew. :) Hmmm That batch is almost a year old; the last attempt on mine was abandoned after a year, without any trickles. I get old ones sometimes too, Some failed several times before I get them. And for different reasons. hadam4h_e0zp_207111_5_887_012043123_3 hadam4h_h1g7_201011_5_889_012045366_3 hadam4h_b0v6_201211_5_882_012036130_1 ID: 64965 ·

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638	Message 64966 - Posted: 13 Jan 2022, 23:23:03 UTC - in response to Message 64964. Last modified: 13 Jan 2022, 23:28:22 UTC Same here. Suspended pending 926 tasks and started the 895 one just to check if it would run OK. Looking good at the moment. There seem to be a lot of computers that are "idle" for several months. Extra grist to the mill for a shortening of the allowed/estimated completion time before reissuing the task. ID: 64966 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64967 - Posted: 14 Jan 2022, 0:42:55 UTC - in response to Message 64966. Suspended pending 926 tasks and started the 895 one just to check if it would run OK. I take them as they come. My N216 tasks mostly run OK, but all 13 or so 926 task have failed with segmentation faults after 2 to 4 seconds of processor time. These have all been in 2022 January. The same model has worked OK in the past. I have one more downloaded, but there are 4 CPDN tasks in the queue before that one.. I can run up to 4 CPDN tasks at a time on this machine. Actually, I have an 8-core machine that can actually run 16 tasks at a time, but it does not make sense for me to run more than 8 Boinc tasks at a time. ID: 64967 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431	Message 64968 - Posted: 14 Jan 2022, 17:05:35 UTC A total of 107 of #926 are now showing as completed. I will have to wait till some N216 tasks complete before I can try any more. ID: 64968 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64970 - Posted: 15 Jan 2022, 7:17:04 UTC Right now my machine is idle except for 8 Boinc tasks. Of these, three are N216 CPDN tasks and five are WCG models. Of those, two are OPN1, two are ARP1, and one is MCM1. The essentials of my machine are: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB <---<<< top - 02:13:32 up 16 days, 12:00, 1 user, load average: 8.08, 8.25, 8.31 Tasks: 463 total, 9 running, 453 sleeping, 1 stopped, 0 zombie %Cpu(s): 0.3 us, 2.8 sy, 47.0 ni, 49.7 id, 0.0 wa, 0.1 hi, 0.1 si, 0.0 st MiB Mem : 63902.2 total, 867.4 free, 9746.1 used, 53288.6 buff/cache MiB Swap: 15992.0 total, 15043.0 free, 949.0 used. 53304.0 avail Mem Note: with all the RAM I have, it does essentially no paging. Now let us look at the cache hit ratio. Almost half of the memory requests are found in the cache with this procesor. # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 37,810,375,576 cache-references 20,135,725,239 cache-misses # 53.254 % of all cache refs 60.935538193 seconds time elapsed ID: 64970 ·