Thread 'New work Discussion'

href="https://main.cpdn.org/view_profile.php?userid=492414"> Profile

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022

Author	Message
KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63320 - Posted: 16 Jan 2021, 12:16:13 UTC At least the Linux WU's have taken a big hit after all this. I think a lot of Windows users have installed VM's. Yes, at 16 GB of memory, I am at the ablest to run three WU's. I think one GB the VM itself is using and the rest by the system itself. ID: 63320 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63321 - Posted: 16 Jan 2021, 13:10:57 UTC - in response to Message 63306. so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about. I have limited my machine to two African Rain Projects maximum at a time. In practice, sometimes two of them run at once, but mostly only one. Three N216 tasks run all the time. Right now, it is like this: PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND 671142 671122 boinc 39 19 R 1.4g 19760 2.2 99.8 1 11033:00 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1165471 1165464 boinc 39 19 R 1.3g 19808 2.1 99.8 7 1369:13 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1229683 1229662 boinc 39 19 R 1.3g 19852 2.1 99.8 5 24:41.03 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1214305 2079 boinc 39 19 R 761924 28688 1.2 99.8 6 366:17.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 1219941 2079 boinc 39 19 R 761644 28688 1.2 99.8 13 232:58.03 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 1221370 2079 boinc 39 19 R 670732 83200 1.0 99.8 4 142:13.06 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.+ 1226978 2079 boinc 39 19 R 131792 2092 0.2 99.8 0 73:27.79 ../../projects/www.worldcommunitygrid.org/wcgrid_opn+ 1227620 2079 boinc 39 19 R 72996 2464 0.1 99.8 3 57:53.45 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm+ ID: 63321 ·

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 63322 - Posted: 16 Jan 2021, 14:33:46 UTC - in response to Message 63306. If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about. Yes. In fact, I have not found another good WCG project to run with N216. Even MCM, though it does not take up much cache apparently, runs slowly itself on my i7-8700 alongside N216. The best I have found is TN-Grid, though it is not on the BOINC projects list and you will have to add it manually. But it is very light on the resources that N216 needs, and is not itself slowed down either on my machine. http://gene.disi.unitn.it/test/index.php It is not a COVID project per se, but they do some genes related to it. ID: 63322 ·

lazlo_vii Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0	Message 63323 - Posted: 17 Jan 2021, 23:47:41 UTC - in response to Message 63322. Last modified: 18 Jan 2021, 0:02:32 UTC Unless you isolate the workloads to physically coherent caches, which is not possible on all CPU architectures, then you may not find a balance for mixing CPDN tasks with WCG ARP and MIP tasks while using all CPU cores. My testing on Ryzen 3000 CPU's has found that at best I should run one CPDN, ARP, and WIP task per 8MB of isolated L3 cache. I don't even do that. I run each project and sub-project on dedicated three or four core segments of my CPU's because I don't want to deal with fine tuning in greater detail. Don't take this as Gospel. It's just my own very limited testing. If you are intent on mixing BOINC projects I urge you investigate on a per CPU model + OS basis. Do not think that an Intel Haswell generation CPU + Ubuntu can operate with the same mix as an AMD bulldozer or Ryzen + CentOS. Prove what it can take to yourself and then, please, share it with all of us. I say this because different distros will use different kernel verisions, among other things. My biggest issue right now is that I have had numerous CPDN WU's fail lately. I think the cause is do to (and I have no way to prove this because I cannot see the code) CPDN setting a lower priority on disk I/O than other projects and other background tasks I have running. Even after a five minute waiting period before restarting a computer and it's containers, tasks from CPDN have failed because they couldn't (wouldn't?) write their data to disk (because it was too busy?) before I rebooted. This is my main frustration with CPDN at the moment. Two days ago I decided to give up on CPDN for the short term (again) because it is just a waste of time and electricity to crunch numbers on tasks that can't take a system reboot for security updates. ID: 63323 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63324 - Posted: 18 Jan 2021, 9:44:55 UTC Interesting, at the moment I am running five tasks out of 8 real cores on my Ryzen7. Even if I run 8 I seem to rarely get crashes when i reboot. I find running 5 of the N216 tasks is the sweet spot with little gain if any in throughput going up to 8 and using virtual cores throughput actually drops. I have never tried using containers but wonder if that could be a factor. I am using Ubuntu20.10 and BOINC7.17.0 compiled locally from source from Git-hub. ID: 63324 ·

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 63325 - Posted: 18 Jan 2021, 13:37:50 UTC - in response to Message 63324. Even if I run 8 I seem to rarely get crashes when i reboot. I only get a crash when I reboot after an Ubuntu update (mainly of the Linux kernel). To prevent that, suspend the work unit before the reboot, and it should work. ID: 63325 ·

lazlo_vii Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0	Message 63327 - Posted: 18 Jan 2021, 16:13:39 UTC - in response to Message 63325. Last modified: 18 Jan 2021, 16:15:06 UTC Even if I run 8 I seem to rarely get crashes when i reboot. I only get a crash when I reboot after an Ubuntu update (mainly of the Linux kernel). To prevent that, suspend the work unit before the reboot, and it should work. I have tried suspending CPDN task before rebooting however I still had failures. I think it's because the disks I use to store all of BOINC data on (for all four of my computers and each computer runs at least two containers) are writing data almost constantly. All of my BOINC projects (60 threads of crunching) average about 180MB of data written per minute across a four disk RAID10 array. Since three of the systems use the storage across the network that means I have to shut them down before I reboot the main server. If I suspend all of my BOINC projects on the main server and wait about five minutes before I reboot it's not quite so bad but I still loose work sometimes. Also, sometimes I forget to stop everything before the reboot and that is almost certain to cause failures for CPDN. So long story short, rebooting my network is a 10 or 15 minute job instead of a 2 minute job when I run CPDN tasks. I don't want to have go through unusual procedures just because I run CPDN. I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them. ID: 63327 ·

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 63329 - Posted: 18 Jan 2021, 18:06:06 UTC Last modified: 18 Jan 2021, 18:10:47 UTC GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Edited for typo. ID: 63329 ·

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 63330 - Posted: 18 Jan 2021, 18:21:44 UTC - in response to Message 63327. I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them. I am glad you asked. I learned early on in the CPDN game that you needed fast disk access. I was using spinning platters back then, but have since moved to SSDs. In either case, it helps to have a write-cache. In the case of spinning platters, it prevents errors. In the case of SSDs, it protects the SSDs from excessive writes. For Windows, I used either PrimoCache or a ramdisk (Primo Ramdisk or Dataram ramdisk). The cache takes up less memory, and is easier to set up, so that was what I ended up with. But if you have lots of main memory, you can just put the entire BOINC data directory on a ramdisk and you are done. For Ubuntu, I use the built-in caching system, just setting the memory size and latency (the time the writes are held in memory before being wrtten to the disk) to much larger values. This will work nicely though smaller values are possible if you don't have so much memory: Swappiness: to reduce the use of swap: sudo sysctl vm.swappiness=0 Set write cache to 8 GB/8.5 GB: for 32 GB main memory sudo sysctl vm.dirty_background_bytes=8000000000 sudo sysctl vm.dirty_bytes=8500000000 sudo sysctl vm.dirty_writeback_centisecs=500 (checks the cache every 5 seconds) sudo sysctl vm.dirty_expire_centisecs=360000 (page flush 60 min) The first value (8 GB) sets the size of the cache, and the second value (8.5 GB) sets the maximum amount of writes possible before all operataions are halted until the contents are written to the disk. You normally never see that in practice, since the SSDs are usually fast enough to keep up. You flush the cache every 60 minutes, though I often set it to 2 or 4 hours with still larger cache sizes. But you are not writing to the disk as much as the OS writes to the cache, since most values in scientific applications are overwritten many times as the results are updated. With these values on a couple of CPDN work units, I expect you will write less than 30% as much to disk as the work units write from the OS. And they are "easy" writes, since they are serialized. It is the random writes of the raw data that kills SSDs the fastest. Here is the info: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ You can use much smaller values if you don't have so much memory. A 2 GB cache and 30 minute latency will extend the life of the SSD a lot. ID: 63330 ·

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63331 - Posted: 18 Jan 2021, 18:39:20 UTC If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one. ID: 63331 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 63332 - Posted: 18 Jan 2021, 19:02:35 UTC - in response to Message 63331. If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one. When you close Boinc, do you get asked "do you wish to stop tasks running"? If not, you may have ticked a "don't ask again" box in there. If you don't get the dialog, in Boinc Manager go to options menu, other options, general tab, enable manager exit dialog. Also, make sure you have: Options menu, computing preferences, disk and memory tab, "leave non-GPU tasks in memory when suspended". This stops the climate tasks screwing up if Boinc pauses them to run another project. ID: 63332 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 63333 - Posted: 18 Jan 2021, 19:03:52 UTC - in response to Message 63329. GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Edited for typo. 12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat! ID: 63333 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63335 - Posted: 18 Jan 2021, 19:42:08 UTC Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63335 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 63336 - Posted: 18 Jan 2021, 19:48:45 UTC - in response to Message 63335. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5. Any news on a reissue of the SAFR batch? ID: 63336 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63337 - Posted: 18 Jan 2021, 20:05:06 UTC Any news on a reissue of the SAFR batch? No mention of it on the moderator forums. I suspect the first mention of it will be just before it goes out. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63337 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63338 - Posted: 18 Jan 2021, 20:07:34 UTC - in response to Message 63330. My machine does have a 512GByte SSD, but my BOINC stuff is on a 7200rpm spinning hard drive. Seagate BarraCuda 2TB Internal Hard Drive HDD - 3.5 Inch SATA 6Gb/s 7200 RPM 256MB Cache [in the drive] I am running Red Hat Enterprise Linux release 8.2 (Ootpa) with kernel 4.18.0-193.28.1.el8_2.x86_64 Now all the BOINC stuff is on drive /dev/sdb3, and nothing else is on that partition. The other partitions are not very busy now. This shows disk traffic at 5-minute intervals. $ iostat -t -y --human -d /dev/sdb3 300 Linux 4.18.0-193.28.1.el8_2.x86_64 (localhost.localdomain) 01/18/2021 _x86_64_ (16 CPU) 01/18/2021 02:09:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 3.36 0.0k 1.4M 4.0k 413.4M 01/18/2021 02:14:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 2.68 0.0k 246.2k 0.0k 72.1M 01/18/2021 02:19:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 4.71 0.0k 1.4M 0.0k 410.7M 01/18/2021 02:24:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 6.94 0.0k 2.7M 0.0k 804.4M 01/18/2021 02:29:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 4.02 0.0k 1.5M 0.0k 444.0M This shows my BOINC workload; it seems to complete an N216 CPDN work unit in about a week. top - 14:18:45 up 17 days, 20:36, 1 user, load average: 8.42, 8.31, 8.30 Tasks: 474 total, 9 running, 465 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.2 sy, 49.9 ni, 47.6 id, 1.7 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 63943.9 total, 2656.3 free, 8170.4 used, 53117.2 buff/cache MiB Swap: 15992.0 total, 15821.7 free, 170.2 used. 54877.1 avail Mem USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND boinc 39 19 R 1.3g 19972 2.1 99.7 0 2666:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 1.3g 19808 2.1 99.7 3 4602:04 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 1.3g 19852 2.1 99.7 6 3257:36 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 761416 28688 1.2 99.7 7 214:52.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_w+ boinc 39 19 R 421532 76884 0.6 99.7 2 171:19.91 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_+ boinc 39 19 R 113348 2640 0.2 99.9 5 36:26.20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ boinc 39 19 R 106192 2632 0.2 99.7 1 9:26.41 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ boinc 39 19 R 72996 2464 0.1 99.9 4 74:22.78 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_m+ ID: 63338 ·

lazlo_vii Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0	Message 63339 - Posted: 18 Jan 2021, 22:10:01 UTC - in response to Message 63338. Right now the ID:

class="panel-body">Here is the iostat output for my RAID10 array: e:pre-wrap;">$ iostat -tymd /dev/sd[e-h] 300 (bsquad-host-1) 01/18/21 _x86_64_ (32 CPU) tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd 0.00 2.38 0.00 0 715 0 0.00 2.38 0.00 0 715 0 0.00 2.38 0.00 0 715 0 0.00 2.38 0.00 0 715 0 tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd 0.00 2.17 0.00 0 649 0 0.00 2.16 0.00 0 647 0 0.00 2.16 0.00 0 647 0 0.00 2.17 0.00 0 649 0 tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd 0.00 2.61 0.00 0 784 0 0.00 2.61 0.00 0 783 0 0.00 2.61 0.00 0 784 0 0.00 2.61 0.00 0 784 0 tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd 0.00 2.29 0.00 0 687 0 0.00 2.29 0.00 0 686 0 0.00 2.29 0.00 0 686 0 0.00 2.29 0.00 0 687 0 only active workload on these disks is BOINC WGC. 63339 ·
Message 63340 - Posted: 18 Jan 2021, 23:11:22 UTC - in response to Message 63333. Last modified: 18 Jan 2021, 23:12:00 UTC [/quote]12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat![/quote] My machines are 3 laptops with only 4 cores (2 physical and 2 hyperthreaded) each. So 12 is a lot for me. Also I managed to snag 10 more before the supply ran out. ID: 63340 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63341 - Posted: 18 Jan 2021, 23:45:47 UTC - in response to Message 63331. KAMasud The last few versions of BOINC have had some features removed, to meet the requirements of some organisations that don't want people to be able to fiddle. This may be why your tasks are still running when you exit. You'll need to check what options you have in the menu, possibly under File. As for cpdn models, they each have a lot of files open, which all need to be saved before shutdown. If shutdown occurs in the middle of a model doing a save, then some of what is saved is "old", and some is "new", and the program can't restart that model. ID: 63341 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63342 - Posted: 18 Jan 2021, 23:50:12 UTC - in response to Message 63340. Jim You'll need to run faster if you want to keep up. :) The 2nd batch is there now. ID: 63342 ·