Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 53 · 54 · 55 · 56 · 57 · 58 · 59 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
At least the Linux WU's have taken a big hit after all this. I think a lot of Windows users have installed VM's. Yes, at 16 GB of memory, I am at the ablest to run three WU's. I think one GB the VM itself is using and the rest by the system itself. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. I have limited my machine to two African Rain Projects maximum at a time. In practice, sometimes two of them run at once, but mostly only one. Three N216 tasks run all the time. Right now, it is like this: PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND 671142 671122 boinc 39 19 R 1.4g 19760 2.2 99.8 1 11033:00 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1165471 1165464 boinc 39 19 R 1.3g 19808 2.1 99.8 7 1369:13 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1229683 1229662 boinc 39 19 R 1.3g 19852 2.1 99.8 5 24:41.03 /var/lib/boinc/projects/climateprediction.net/hadam4+ 1214305 2079 boinc 39 19 R 761924 28688 1.2 99.8 6 366:17.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 1219941 2079 boinc 39 19 R 761644 28688 1.2 99.8 13 232:58.03 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 1221370 2079 boinc 39 19 R 670732 83200 1.0 99.8 4 142:13.06 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.+ 1226978 2079 boinc 39 19 R 131792 2092 0.2 99.8 0 73:27.79 ../../projects/www.worldcommunitygrid.org/wcgrid_opn+ 1227620 2079 boinc 39 19 R 72996 2464 0.1 99.8 3 57:53.45 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm+ |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about. Yes. In fact, I have not found another good WCG project to run with N216. Even MCM, though it does not take up much cache apparently, runs slowly itself on my i7-8700 alongside N216. The best I have found is TN-Grid, though it is not on the BOINC projects list and you will have to add it manually. But it is very light on the resources that N216 needs, and is not itself slowed down either on my machine. http://gene.disi.unitn.it/test/index.php It is not a COVID project per se, but they do some genes related to it. |
Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0 |
Unless you isolate the workloads to physically coherent caches, which is not possible on all CPU architectures, then you may not find a balance for mixing CPDN tasks with WCG ARP and MIP tasks while using all CPU cores. My testing on Ryzen 3000 CPU's has found that at best I should run one CPDN, ARP, and WIP task per 8MB of isolated L3 cache. I don't even do that. I run each project and sub-project on dedicated three or four core segments of my CPU's because I don't want to deal with fine tuning in greater detail. Don't take this as Gospel. It's just my own very limited testing. If you are intent on mixing BOINC projects I urge you investigate on a per CPU model + OS basis. Do not think that an Intel Haswell generation CPU + Ubuntu can operate with the same mix as an AMD bulldozer or Ryzen + CentOS. Prove what it can take to yourself and then, please, share it with all of us. I say this because different distros will use different kernel verisions, among other things. My biggest issue right now is that I have had numerous CPDN WU's fail lately. I think the cause is do to (and I have no way to prove this because I cannot see the code) CPDN setting a lower priority on disk I/O than other projects and other background tasks I have running. Even after a five minute waiting period before restarting a computer and it's containers, tasks from CPDN have failed because they couldn't (wouldn't?) write their data to disk (because it was too busy?) before I rebooted. This is my main frustration with CPDN at the moment. Two days ago I decided to give up on CPDN for the short term (again) because it is just a waste of time and electricity to crunch numbers on tasks that can't take a system reboot for security updates. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
Interesting, at the moment I am running five tasks out of 8 real cores on my Ryzen7. Even if I run 8 I seem to rarely get crashes when i reboot. I find running 5 of the N216 tasks is the sweet spot with little gain if any in throughput going up to 8 and using virtual cores throughput actually drops. I have never tried using containers but wonder if that could be a factor. I am using Ubuntu20.10 and BOINC7.17.0 compiled locally from source from Git-hub. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Even if I run 8 I seem to rarely get crashes when i reboot. I only get a crash when I reboot after an Ubuntu update (mainly of the Linux kernel). To prevent that, suspend the work unit before the reboot, and it should work. |
Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0 |
Even if I run 8 I seem to rarely get crashes when i reboot. I have tried suspending CPDN task before rebooting however I still had failures. I think it's because the disks I use to store all of BOINC data on (for all four of my computers and each computer runs at least two containers) are writing data almost constantly. All of my BOINC projects (60 threads of crunching) average about 180MB of data written per minute across a four disk RAID10 array. Since three of the systems use the storage across the network that means I have to shut them down before I reboot the main server. If I suspend all of my BOINC projects on the main server and wait about five minutes before I reboot it's not quite so bad but I still loose work sometimes. Also, sometimes I forget to stop everything before the reboot and that is almost certain to cause failures for CPDN. So long story short, rebooting my network is a 10 or 15 minute job instead of a 2 minute job when I run CPDN tasks. I don't want to have go through unusual procedures just because I run CPDN. I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Edited for typo. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them. I am glad you asked. I learned early on in the CPDN game that you needed fast disk access. I was using spinning platters back then, but have since moved to SSDs. In either case, it helps to have a write-cache. In the case of spinning platters, it prevents errors. In the case of SSDs, it protects the SSDs from excessive writes. For Windows, I used either PrimoCache or a ramdisk (Primo Ramdisk or Dataram ramdisk). The cache takes up less memory, and is easier to set up, so that was what I ended up with. But if you have lots of main memory, you can just put the entire BOINC data directory on a ramdisk and you are done. For Ubuntu, I use the built-in caching system, just setting the memory size and latency (the time the writes are held in memory before being wrtten to the disk) to much larger values. This will work nicely though smaller values are possible if you don't have so much memory: Swappiness: to reduce the use of swap: sudo sysctl vm.swappiness=0 Set write cache to 8 GB/8.5 GB: for 32 GB main memory sudo sysctl vm.dirty_background_bytes=8000000000 sudo sysctl vm.dirty_bytes=8500000000 sudo sysctl vm.dirty_writeback_centisecs=500 (checks the cache every 5 seconds) sudo sysctl vm.dirty_expire_centisecs=360000 (page flush 60 min) The first value (8 GB) sets the size of the cache, and the second value (8.5 GB) sets the maximum amount of writes possible before all operataions are halted until the contents are written to the disk. You normally never see that in practice, since the SSDs are usually fast enough to keep up. You flush the cache every 60 minutes, though I often set it to 2 or 4 hours with still larger cache sizes. But you are not writing to the disk as much as the OS writes to the cache, since most values in scientific applications are overwritten many times as the results are updated. With these values on a couple of CPDN work units, I expect you will write less than 30% as much to disk as the work units write from the OS. And they are "easy" writes, since they are serialized. It is the random writes of the raw data that kills SSDs the fastest. Here is the info: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ You can use much smaller values if you don't have so much memory. A 2 GB cache and 30 minute latency will extend the life of the SSD a lot. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one.When you close Boinc, do you get asked "do you wish to stop tasks running"? If not, you may have ticked a "don't ask again" box in there. If you don't get the dialog, in Boinc Manager go to options menu, other options, general tab, enable manager exit dialog. Also, make sure you have: Options menu, computing preferences, disk and memory tab, "leave non-GPU tasks in memory when suspended". This stops the climate tasks screwing up if Boinc pauses them to run another project. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed. Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Any news on a reissue of the SAFR batch?Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
Any news on a reissue of the SAFR batch? No mention of it on the moderator forums. I suspect the first mention of it will be just before it goes out. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My machine does have a 512GByte SSD, but my BOINC stuff is on a 7200rpm spinning hard drive. Seagate BarraCuda 2TB Internal Hard Drive HDD - 3.5 Inch SATA 6Gb/s 7200 RPM 256MB Cache [in the drive] I am running Red Hat Enterprise Linux release 8.2 (Ootpa) with kernel 4.18.0-193.28.1.el8_2.x86_64 Now all the BOINC stuff is on drive /dev/sdb3, and nothing else is on that partition. The other partitions are not very busy now. This shows disk traffic at 5-minute intervals. $ iostat -t -y --human -d /dev/sdb3 300 Linux 4.18.0-193.28.1.el8_2.x86_64 (localhost.localdomain) 01/18/2021 _x86_64_ (16 CPU) 01/18/2021 02:09:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 3.36 0.0k 1.4M 4.0k 413.4M 01/18/2021 02:14:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 2.68 0.0k 246.2k 0.0k 72.1M 01/18/2021 02:19:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 4.71 0.0k 1.4M 0.0k 410.7M 01/18/2021 02:24:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 6.94 0.0k 2.7M 0.0k 804.4M 01/18/2021 02:29:27 PM Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdb3 4.02 0.0k 1.5M 0.0k 444.0M This shows my BOINC workload; it seems to complete an N216 CPDN work unit in about a week. top - 14:18:45 up 17 days, 20:36, 1 user, load average: 8.42, 8.31, 8.30 Tasks: 474 total, 9 running, 465 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.2 sy, 49.9 ni, 47.6 id, 1.7 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 63943.9 total, 2656.3 free, 8170.4 used, 53117.2 buff/cache MiB Swap: 15992.0 total, 15821.7 free, 170.2 used. 54877.1 avail Mem USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND boinc 39 19 R 1.3g 19972 2.1 99.7 0 2666:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 1.3g 19808 2.1 99.7 3 4602:04 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 1.3g 19852 2.1 99.7 6 3257:36 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ boinc 39 19 R 761416 28688 1.2 99.7 7 214:52.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_w+ boinc 39 19 R 421532 76884 0.6 99.7 2 171:19.91 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_+ boinc 39 19 R 113348 2640 0.2 99.9 5 36:26.20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ boinc 39 19 R 106192 2632 0.2 99.7 1 9:26.41 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ boinc 39 19 R 72996 2464 0.1 99.9 4 74:22.78 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_m+ |
Send message Joined: 11 Dec 19 Posts: 108 Credit: 3,012,142 RAC: 0 |
Here is the iostat output for my RAID10 array: $ iostat -tymd /dev/sd[e-h] 300 Linux 5.8.0-38-generic (bsquad-host-1) 01/18/21 _x86_64_ (32 CPU) 01/18/21 21:52:45 Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd sde 28.11 0.00 2.38 0.00 0 715 0 sdf 28.88 0.00 2.38 0.00 0 715 0 sdg 28.36 0.00 2.38 0.00 0 715 0 sdh 28.37 0.00 2.38 0.00 0 715 0 01/18/21 21:57:45 Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd sde 26.81 0.00 2.17 0.00 0 649 0 sdf 27.05 0.00 2.16 0.00 0 647 0 sdg 26.16 0.00 2.16 0.00 0 647 0 sdh 27.00 0.00 2.17 0.00 0 649 0 01/18/21 22:02:45 Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd sde 32.42 0.00 2.61 0.00 0 784 0 sdf 32.33 0.00 2.61 0.00 0 783 0 sdg 32.30 0.00 2.61 0.00 0 784 0 sdh 32.18 0.00 2.61 0.00 0 784 0 01/18/21 22:07:45 Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd sde 26.28 0.00 2.29 0.00 0 687 0 sdf 27.41 0.00 2.29 0.00 0 686 0 sdg 26.60 0.00 2.29 0.00 0 686 0 sdh 27.51 0.00 2.29 0.00 0 687 0 Right now the only active workload on these disks is BOINC WGC. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
[/quote]12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat![/quote] My machines are 3 laptops with only 4 cores (2 physical and 2 hyperthreaded) each. So 12 is a lot for me. Also I managed to snag 10 more before the supply ran out. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
KAMasud The last few versions of BOINC have had some features removed, to meet the requirements of some organisations that don't want people to be able to fiddle. This may be why your tasks are still running when you exit. You'll need to check what options you have in the menu, possibly under File. As for cpdn models, they each have a lot of files open, which all need to be saved before shutdown. If shutdown occurs in the middle of a model doing a save, then some of what is saved is "old", and some is "new", and the program can't restart that model. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Jim You'll need to run faster if you want to keep up. :) The 2nd batch is there now. |
©2024 cpdn.org