climateprediction.net (CPDN) home page
Thread 'New work Discussion'

Thread 'New work Discussion'

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 53 · 54 · 55 · 56 · 57 · 58 · 59 . . . 91 · Next

AuthorMessage
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63320 - Posted: 16 Jan 2021, 12:16:13 UTC

At least the Linux WU's have taken a big hit after all this. I think a lot of Windows users have installed VM's.
Yes, at 16 GB of memory, I am at the ablest to run three WU's. I think one GB the VM itself is using and the rest by the system itself.
ID: 63320 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63321 - Posted: 16 Jan 2021, 13:10:57 UTC - in response to Message 63306.  

so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units.



If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about.

I have limited my machine to two African Rain Projects maximum at a time. In practice, sometimes two of them run at once, but mostly only one. Three N216 tasks run all the time. Right now, it is like this:

    PID    PPID USER      PR  NI S    RES    SHR  %MEM  %CPU  P     TIME+ COMMAND                                               
 671142  671122 boinc     39  19 R   1.4g  19760   2.2  99.8  1  11033:00 /var/lib/boinc/projects/climateprediction.net/hadam4+ 
1165471 1165464 boinc     39  19 R   1.3g  19808   2.1  99.8  7   1369:13 /var/lib/boinc/projects/climateprediction.net/hadam4+ 
1229683 1229662 boinc     39  19 R   1.3g  19852   2.1  99.8  5  24:41.03 /var/lib/boinc/projects/climateprediction.net/hadam4+ 
1214305    2079 boinc     39  19 R 761924  28688   1.2  99.8  6 366:17.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 
1219941    2079 boinc     39  19 R 761644  28688   1.2  99.8 13 232:58.03 ../../projects/www.worldcommunitygrid.org/wcgrid_arp+ 
1221370    2079 boinc     39  19 R 670732  83200   1.0  99.8  4 142:13.06 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.+ 
1226978    2079 boinc     39  19 R 131792   2092   0.2  99.8  0  73:27.79 ../../projects/www.worldcommunitygrid.org/wcgrid_opn+ 
1227620    2079 boinc     39  19 R  72996   2464   0.1  99.8  3  57:53.45 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm+ 

ID: 63321 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63322 - Posted: 16 Jan 2021, 14:33:46 UTC - in response to Message 63306.  

If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about.

Yes. In fact, I have not found another good WCG project to run with N216. Even MCM, though it does not take up much cache apparently, runs slowly itself on my i7-8700 alongside N216.

The best I have found is TN-Grid, though it is not on the BOINC projects list and you will have to add it manually.
But it is very light on the resources that N216 needs, and is not itself slowed down either on my machine.
http://gene.disi.unitn.it/test/index.php
It is not a COVID project per se, but they do some genes related to it.
ID: 63322 · Report as offensive
lazlo_vii

Send message
Joined: 11 Dec 19
Posts: 108
Credit: 3,012,142
RAC: 0
Message 63323 - Posted: 17 Jan 2021, 23:47:41 UTC - in response to Message 63322.  
Last modified: 18 Jan 2021, 0:02:32 UTC

Unless you isolate the workloads to physically coherent caches, which is not possible on all CPU architectures, then you may not find a balance for mixing CPDN tasks with WCG ARP and MIP tasks while using all CPU cores. My testing on Ryzen 3000 CPU's has found that at best I should run one CPDN, ARP, and WIP task per 8MB of isolated L3 cache. I don't even do that. I run each project and sub-project on dedicated three or four core segments of my CPU's because I don't want to deal with fine tuning in greater detail.

Don't take this as Gospel. It's just my own very limited testing.

If you are intent on mixing BOINC projects I urge you investigate on a per CPU model + OS basis. Do not think that an Intel Haswell generation CPU + Ubuntu can operate with the same mix as an AMD bulldozer or Ryzen + CentOS. Prove what it can take to yourself and then, please, share it with all of us. I say this because different distros will use different kernel verisions, among other things.

My biggest issue right now is that I have had numerous CPDN WU's fail lately. I think the cause is do to (and I have no way to prove this because I cannot see the code) CPDN setting a lower priority on disk I/O than other projects and other background tasks I have running. Even after a five minute waiting period before restarting a computer and it's containers, tasks from CPDN have failed because they couldn't (wouldn't?) write their data to disk (because it was too busy?) before I rebooted. This is my main frustration with CPDN at the moment. Two days ago I decided to give up on CPDN for the short term (again) because it is just a waste of time and electricity to crunch numbers on tasks that can't take a system reboot for security updates.
ID: 63323 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63324 - Posted: 18 Jan 2021, 9:44:55 UTC

Interesting, at the moment I am running five tasks out of 8 real cores on my Ryzen7. Even if I run 8 I seem to rarely get crashes when i reboot. I find running 5 of the N216 tasks is the sweet spot with little gain if any in throughput going up to 8 and using virtual cores throughput actually drops. I have never tried using containers but wonder if that could be a factor. I am using Ubuntu20.10 and BOINC7.17.0 compiled locally from source from Git-hub.
ID: 63324 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63325 - Posted: 18 Jan 2021, 13:37:50 UTC - in response to Message 63324.  

Even if I run 8 I seem to rarely get crashes when i reboot.

I only get a crash when I reboot after an Ubuntu update (mainly of the Linux kernel). To prevent that, suspend the work unit before the reboot, and it should work.
ID: 63325 · Report as offensive
lazlo_vii

Send message
Joined: 11 Dec 19
Posts: 108
Credit: 3,012,142
RAC: 0
Message 63327 - Posted: 18 Jan 2021, 16:13:39 UTC - in response to Message 63325.  
Last modified: 18 Jan 2021, 16:15:06 UTC

Even if I run 8 I seem to rarely get crashes when i reboot.

I only get a crash when I reboot after an Ubuntu update (mainly of the Linux kernel). To prevent that, suspend the work unit before the reboot, and it should work.


I have tried suspending CPDN task before rebooting however I still had failures. I think it's because the disks I use to store all of BOINC data on (for all four of my computers and each computer runs at least two containers) are writing data almost constantly. All of my BOINC projects (60 threads of crunching) average about 180MB of data written per minute across a four disk RAID10 array. Since three of the systems use the storage across the network that means I have to shut them down before I reboot the main server. If I suspend all of my BOINC projects on the main server and wait about five minutes before I reboot it's not quite so bad but I still loose work sometimes. Also, sometimes I forget to stop everything before the reboot and that is almost certain to cause failures for CPDN. So long story short, rebooting my network is a 10 or 15 minute job instead of a 2 minute job when I run CPDN tasks. I don't want to have go through unusual procedures just because I run CPDN.

I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them.
ID: 63327 · Report as offensive
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 63329 - Posted: 18 Jan 2021, 18:06:06 UTC
Last modified: 18 Jan 2021, 18:10:47 UTC

GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.

Edited for typo.
ID: 63329 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63330 - Posted: 18 Jan 2021, 18:21:44 UTC - in response to Message 63327.  

I wish I knew of an easy way to permanently raise the IO priority of CPDN tasks or the containers I am running them in. That might help me isolate the cause. Another solution might be to move CPDN's storage off of the spinning disks and onto the SSD's but I really don't add write heavy workloads to them.

I am glad you asked. I learned early on in the CPDN game that you needed fast disk access. I was using spinning platters back then, but have since moved to SSDs. In either case, it helps to have a write-cache. In the case of spinning platters, it prevents errors. In the case of SSDs, it protects the SSDs from excessive writes.

For Windows, I used either PrimoCache or a ramdisk (Primo Ramdisk or Dataram ramdisk). The cache takes up less memory, and is easier to set up, so that was what I ended up with. But if you have lots of main memory, you can just put the entire BOINC data directory on a ramdisk and you are done.

For Ubuntu, I use the built-in caching system, just setting the memory size and latency (the time the writes are held in memory before being wrtten to the disk) to much larger values.

This will work nicely though smaller values are possible if you don't have so much memory:
Swappiness: to reduce the use of swap: sudo sysctl vm.swappiness=0

Set write cache to 8 GB/8.5 GB: for 32 GB main memory
sudo sysctl vm.dirty_background_bytes=8000000000
sudo sysctl vm.dirty_bytes=8500000000
sudo sysctl vm.dirty_writeback_centisecs=500 (checks the cache every 5 seconds)
sudo sysctl vm.dirty_expire_centisecs=360000 (page flush 60 min)

The first value (8 GB) sets the size of the cache, and the second value (8.5 GB) sets the maximum amount of writes possible before all operataions are halted until the contents are written to the disk. You normally never see that in practice, since the SSDs are usually fast enough to keep up. You flush the cache every 60 minutes, though I often set it to 2 or 4 hours with still larger cache sizes. But you are not writing to the disk as much as the OS writes to the cache, since most values in scientific applications are overwritten many times as the results are updated. With these values on a couple of CPDN work units, I expect you will write less than 30% as much to disk as the work units write from the OS. And they are "easy" writes, since they are serialized. It is the random writes of the raw data that kills SSDs the fastest.

Here is the info: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
You can use much smaller values if you don't have so much memory. A 2 GB cache and 30 minute latency will extend the life of the SSD a lot.
ID: 63330 · Report as offensive
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63331 - Posted: 18 Jan 2021, 18:39:20 UTC

If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one.
ID: 63331 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63332 - Posted: 18 Jan 2021, 19:02:35 UTC - in response to Message 63331.  

If I may add to this discussion on WU's crashes and system reboots. I have two computers running the same version of Boinc. When I reboot (which I have too) I normally exit Boinc beforehand. One of the computers was crashing WU's. What in my case was happening, on exiting Boinc on this computer the WU's were still running in Taskmanager? Now, what will happen if upfront you have exited Boinc but the WU's are still running and the system goes through a reboot? I have again reinstalled Boinc but the problem is still there. The defaulting computer is an Acer while Dell is cooperating. I have still not been able to solve this one. The version of Boinc is the current one.
When you close Boinc, do you get asked "do you wish to stop tasks running"? If not, you may have ticked a "don't ask again" box in there. If you don't get the dialog, in Boinc Manager go to options menu, other options, general tab, enable manager exit dialog.

Also, make sure you have: Options menu, computing preferences, disk and memory tab, "leave non-GPU tasks in memory when suspended". This stops the climate tasks screwing up if Boinc pauses them to run another project.
ID: 63332 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63333 - Posted: 18 Jan 2021, 19:03:52 UTC - in response to Message 63329.  

GREAT NEWS! MORE WORK FOR WINDOWS. I just managed to download12 of the new batch (batch 892) on 3 machines. Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.

Edited for typo.
12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat!
ID: 63333 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63335 - Posted: 18 Jan 2021, 19:42:08 UTC

Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.


Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63335 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63336 - Posted: 18 Jan 2021, 19:48:45 UTC - in response to Message 63335.  

Hopefully, these will be better that the last Windlows batch. I downloaded 6 of those and every one of them crashed.
Both 892 and 893 are for the EU region rather than the SAFR region which has proved problematic in the past so I would expect them to be better behaved. Also George as George suggests there should be one more batch for this lot which will be looking at a 2degree change rather than 1.5.
Any news on a reissue of the SAFR batch?
ID: 63336 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63337 - Posted: 18 Jan 2021, 20:05:06 UTC

Any news on a reissue of the SAFR batch?


No mention of it on the moderator forums. I suspect the first mention of it will be just before it goes out.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63337 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63338 - Posted: 18 Jan 2021, 20:07:34 UTC - in response to Message 63330.  

My machine does have a 512GByte SSD, but my BOINC stuff is on a 7200rpm spinning hard drive.
Seagate BarraCuda 2TB Internal Hard Drive
HDD - 3.5 Inch SATA 6Gb/s 7200 RPM 256MB Cache [in the drive]
I am running Red Hat Enterprise Linux release 8.2 (Ootpa)
with kernel 4.18.0-193.28.1.el8_2.x86_64

Now all the BOINC stuff is on drive /dev/sdb3, and nothing else is on that partition. The other partitions are not very busy now. This shows disk traffic at 5-minute intervals.
$ iostat -t -y --human -d /dev/sdb3 300
Linux 4.18.0-193.28.1.el8_2.x86_64 (localhost.localdomain) 	01/18/2021 	_x86_64_	(16 CPU)

01/18/2021 02:09:27 PM
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb3              3.36         0.0k         1.4M       4.0k     413.4M

01/18/2021 02:14:27 PM
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb3              2.68         0.0k       246.2k       0.0k      72.1M

01/18/2021 02:19:27 PM
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb3              4.71         0.0k         1.4M       0.0k     410.7M

01/18/2021 02:24:27 PM
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb3              6.94         0.0k         2.7M       0.0k     804.4M

01/18/2021 02:29:27 PM
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb3              4.02         0.0k         1.5M       0.0k     444.0M


This shows my BOINC workload; it seems to complete an N216 CPDN work unit in about a week.

top - 14:18:45 up 17 days, 20:36,  1 user,  load average: 8.42, 8.31, 8.30
Tasks: 474 total,   9 running, 465 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.2 sy, 49.9 ni, 47.6 id,  1.7 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem :  63943.9 total,   2656.3 free,   8170.4 used,  53117.2 buff/cache
MiB Swap:  15992.0 total,  15821.7 free,    170.2 used.  54877.1 avail Mem 

USER   PR  NI S    RES    SHR    %MEM  %CPU  P     TIME+ COMMAND                                                  
boinc  39  19 R   1.3g  19972     2.1  99.7  0   2666:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ 
boinc  39  19 R   1.3g  19808     2.1  99.7  3   4602:04 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ 
boinc  39  19 R   1.3g  19852     2.1  99.7  6   3257:36 /var/lib/boinc/projects/climateprediction.net/hadam4_um+ 
boinc  39  19 R 761416  28688     1.2  99.7  7 214:52.58 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_w+ 
boinc  39  19 R 421532  76884     0.6  99.7  2 171:19.91 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_+ 
boinc  39  19 R 113348   2640     0.2  99.9  5  36:26.20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ 
boinc  39  19 R 106192   2632     0.2  99.7  1   9:26.41 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_a+ 
boinc  39  19 R  72996   2464     0.1  99.9  4  74:22.78 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_m+

ID: 63338 · Report as offensive
lazlo_vii

Send message
Joined: 11 Dec 19
Posts: 108
Credit: 3,012,142
RAC: 0
Message 63339 - Posted: 18 Jan 2021, 22:10:01 UTC - in response to Message 63338.  

Here is the iostat output for my RAID10 array:

$ iostat -tymd /dev/sd[e-h] 300
Linux 5.8.0-38-generic (bsquad-host-1) 	01/18/21 	_x86_64_	(32 CPU)


01/18/21 21:52:45
Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
sde              28.11         0.00         2.38         0.00          0        715          0
sdf              28.88         0.00         2.38         0.00          0        715          0
sdg              28.36         0.00         2.38         0.00          0        715          0
sdh              28.37         0.00         2.38         0.00          0        715          0


01/18/21 21:57:45
Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
sde              26.81         0.00         2.17         0.00          0        649          0
sdf              27.05         0.00         2.16         0.00          0        647          0
sdg              26.16         0.00         2.16         0.00          0        647          0
sdh              27.00         0.00         2.17         0.00          0        649          0


01/18/21 22:02:45
Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
sde              32.42         0.00         2.61         0.00          0        784          0
sdf              32.33         0.00         2.61         0.00          0        783          0
sdg              32.30         0.00         2.61         0.00          0        784          0
sdh              32.18         0.00         2.61         0.00          0        784          0


01/18/21 22:07:45
Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
sde              26.28         0.00         2.29         0.00          0        687          0
sdf              27.41         0.00         2.29         0.00          0        686          0
sdg              26.60         0.00         2.29         0.00          0        686          0
sdh              27.51         0.00         2.29         0.00          0        687          0


Right now the only active workload on these disks is BOINC WGC.
ID: 63339 · Report as offensive
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 63340 - Posted: 18 Jan 2021, 23:11:22 UTC - in response to Message 63333.  
Last modified: 18 Jan 2021, 23:12:00 UTC

[/quote]12? [giggle] Sorry. I have 89 on 7 computers. Global warming is occurring inside my house, these things make a lot of heat![/quote]

My machines are 3 laptops with only 4 cores (2 physical and 2 hyperthreaded) each. So 12 is a lot for me. Also I managed to snag 10 more before the supply ran out.
ID: 63340 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63341 - Posted: 18 Jan 2021, 23:45:47 UTC - in response to Message 63331.  

KAMasud

The last few versions of BOINC have had some features removed, to meet the requirements of some organisations that don't want people to be able to fiddle.
This may be why your tasks are still running when you exit.
You'll need to check what options you have in the menu, possibly under File.

As for cpdn models, they each have a lot of files open, which all need to be saved before shutdown.
If shutdown occurs in the middle of a model doing a save, then some of what is saved is "old", and some is "new", and the program can't restart that model.
ID: 63341 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63342 - Posted: 18 Jan 2021, 23:50:12 UTC - in response to Message 63340.  

Jim

You'll need to run faster if you want to keep up. :)
The 2nd batch is there now.
ID: 63342 · Report as offensive
Previous · 1 . . . 53 · 54 · 55 · 56 · 57 · 58 · 59 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 cpdn.org