Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
I suppose that the largest group of CPDN contributors run CPDN alongside several other projects, which, for all practical purposes, puts them into the "bottlenecked by CPU" category. Jean-David Beyer wrote: cache-misses # 49.552 % of all cache refs(That's on a Xeon W2245 with 8c/16t, 16.5 MB last-level cache, mixed workload with 3x OpenIFS. I suspect that some Rosetta work can be quite cache hungry too. Not sure of any of the others. You could check by ) Sample from a dual-Epyc 7452 = 2x {32c/64t, 8x 16 MB last-level cache}, running merely 5 OpenIFS at the moment because I want to clear a backlog of uploads, plus 59 PrimeGrid llrSGS which have a known cache footprint of 1 MBytes. That is, only 64 of 128 logical CPUs are used at a time. Also, it's a headless system; display-manager service is shut down. system-wide: ~10% cache misses looking at one of the master.exe processes: ~11% cache misses looking at one of the sllr processes: ~10% cache misses It's remarkable that master.exe and sllr processes have the same cache miss rate. But it's of course only a very small and quick sample which I took for now. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
Steven, That's a nice summary of the different problems we're seeing. I'm going to print that out as it's better than my notes! As Dave says, the memory corruption is the more serious one. The kills the controlling process (not the boinc client), which looks after running the model. The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process. There is also a problem at the very end of the task where the last upload file seems to go missing. I have seen this happen once now on my test machine. it's some kind of race condition between the multiple processes managing the task, one process doesn't get information when it should, but I haven't pinned down exactly what's going on. My impression is that the tasks run better if there are 1,2 running at a time. Mine are running fine, I've only had 1 failure so far this way. I was looking at the batch statistics and the success/failure ratio has improved for the latest batch compared to the first batch. Perhaps that's in part because everyone has got better at managing the tasks after the first batch? Apologies for the failures and your time, but the summary was very useful and maybe the moderators can refer others to that post. I have fixed the memory and disk bounds for these tasks and I've started looking at these other issues with the CPDN folk. Regards, Glenn Edit: and thanks for the posts about broadband & upload speeds. CPDN have a tool for computing workunit output sizes to advise scientists on what's acceptable. I'll pass that info on to them. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
Where are you copying those figures from? I see these for your two machines on the CPDN website:Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 477.76 GB Measured floating point speed 6.13 billion ops/sec Measured integer speed 26.09 billion ops/sec Average upload rate 4480.76 KB/sec Average download rate 45235.53 KB/secI think the data rates reported by Boinc-CPDN are really Kilobits per second, not KiloBytes per second.) Average upload rate 4308.41 KB/sec (Xeon) Average upload rate 136.81 KB/sec (Windows 10) I get this on a 15.9 Mbps uplink fibre ADSL line: Average upload rate 603.55 KB/sec I think that's consistent with the BOINC measurement being based on bytes. Some figures may be skewed if there's a proxy server in the loop. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process.I think that's unlikely. The BOINC client is pretty robust about checking that a proposed slot is genuinely empty before starting a new task in it: if there's any doubt, it creates a new slot directory and starts the task there instead. We did have a problem a few years ago where files over 4 GB (!) were invisible to the checking routine, but that's been long fixed. |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
Quite prepared to accept that's the case. But looking at the task logs there's some output in some of them that suggests the model is still running.The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process.I think that's unlikely. The BOINC client is pretty robust about checking that a proposed slot is genuinely empty before starting a new task in it: if there's any doubt, it creates a new slot directory and starts the task there instead. Possible scenario is: wrapper dies, leaving the master.exe process still running. boinc client detects the wrapper (i.e task) has died and clears out the slot directory. However, master.exe will write to the same slot every, let's say, 1 min, writing to a couple of text output log files. So the slot dir appears empty until the process does the write. In the meantime, the client has started another task, and then 30secs later the first master.exe then writes to the files (which now exist because the new model task has started). Possible? |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Just downloaded a resend of a Work Unit that failed due to an error. This Task 22245903 It failed due to running longer than 5 minutes after the work unit had finished. The WU was run by mikey and other than the longer run time after finishing seemed to have run successfully after over 2 days run time. The run time seems overly long on a Ryzen but did complete. It is now running as Task 22249047 on my Ryzen computer. Will see how it runs for me. Conan |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
Possible scenario is: wrapper dies, leaving the master.exe process still running. boinc client detects the wrapper (i.e task) has died and clears out the slot directory. However, master.exe will write to the same slot every, let's say, 1 min, writing to a couple of text output log files. So the slot dir appears empty until the process does the write. In the meantime, the client has started another task, and then 30secs later the first master.exe then writes to the files (which now exist because the new model task has started). Possible?Yes, I suppose so. If master.exe is still running in memory, but the slot directory has had all files deleted, then the BOINC client could nip in and start launching a new task in the split second before the next write (and the final 'empty folder' check is a single test just before the next task is launched - there's no ongoing verification). I think the slots mostly contain symlinks to files stored in the project directory? Does anything check that those links are still valid? |
Send message Joined: 29 Oct 17 Posts: 1052 Credit: 16,726,716 RAC: 12,672 |
looking at one of the master.exe processes: ~11% cache missesShould look at TLB misses, they are expensive. I'm not sure if that's what you mean by a 'cache miss'. Here is an idea: It seems as if watching the stderr.txt could be a useful tool to estimate progress rate of an OpenIFS task. I could use that to explore relative performance of different task distributions.Find the slot directory for the task and find the 'ifs.stat' file, this is the file that the controlling wrapper process is watching. % tail -f ifs.stat 11:24:28 0AAA00AAA STEPO 512 16.886 16.886 26.918 177:55 1298:49 0.11147937005926E-04 2GB 0MB 11:24:58 0AAA00AAA STEPO 513 17.573 17.573 29.619 178:12 1299:18 0.11141352198812E-04 2GB 0MB The 4th column is the current model step. The 5th column is what you want, this is the CPU time of the last step. I optimize my setup by watching this number, aim to get it as low as possible. Note that when the model is doing output, there are multiple lines per step. For info the rest of the columns are (not all may work well outside ECMWF): 1 : wall-clock time 2 : model configuration (short code for exactly what the model is doing) 3 : name of calling routine 4 : timestep count 5 : CPU time of last step 6 : vector CPU time of last step (throwback to the old days when the model ran on vector hardware) 7 : wall-clock time of last step 8 : accumulated cpu time 9 : accumulated wall-clock time 10 : L2 norm of global divergence field (used to check bit-reproducibility) 11 & 12 : heap and stack memory, these don't work well outside of ECMWF. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
cache-misses # 49.552 % of all cache refs Right. Is this more helpful? (Taken at a different time from the previous one I showed, but similar work load.) # perf stat -aB -d -d -e cache-references,cache-misses Performance counter stats for 'system wide': 21,080,333,138 cache-references (36.36%) 10,457,921,236 cache-misses # 49.610 % of all cache refs (36.37%) 1,372,155,371,235 L1-dcache-loads (36.37%) 78,486,820,592 L1-dcache-load-misses # 5.72% of all L1-dcache accesses (36.37%) 4,986,357,186 LLC-loads (36.37%) 3,181,661,273 LLC-load-misses # 63.81% of all LL-cache accesses (36.36%) <not supported> L1-icache-loads 5,060,458,674 L1-icache-load-misses (36.36%) 1,373,019,705,796 dTLB-loads (36.36%) 117,604,750 dTLB-load-misses # 0.01% of all dTLB cache accesses (36.36%) 158,451,773 iTLB-loads (36.36%) 29,511,730 iTLB-load-misses # 18.63% of all iTLB cache accesses (36.36%) 62.707952810 seconds time elapsed I then suspended all Rosetta tasks and got this: # perf stat -aB -d -d -e cache-references,cache-misses Performance counter stats for 'system wide': 20,554,374,124 cache-references (36.36%) 10,415,226,289 cache-misses # 50.672 % of all cache refs (36.36%) 1,205,539,957,850 L1-dcache-loads (36.36%) 70,253,511,063 L1-dcache-load-misses # 5.83% of all L1-dcache accesses (36.37%) 4,768,042,362 LLC-loads (36.37%) 3,149,369,211 LLC-load-misses # 66.05% of all LL-cache accesses (36.36%) <not supported> L1-icache-loads 4,194,578,638 L1-icache-load-misses (36.36%) 1,206,853,267,867 dTLB-loads (36.36%) 50,345,945 dTLB-load-misses # 0.00% of all dTLB cache accesses (36.36%) 58,005,275 iTLB-loads (36.36%) 20,350,600 iTLB-load-misses # 35.08% of all iTLB cache accesses (36.36%) 62.973239730 seconds time elapsed |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Where are you copying those figures from? I see these for your two machines on the CPDN website: I do not think it is bytes. Consider the reported download rate of 25457 K download rate. That would be 203656K bits per second -- 203 Megabits per second. The most I could possibly get from my Internet connection (fibre-optic) is 75 Megabits per second. Those numbers I got were from the "(Xeon)" machine, not the "(Windows 10)" machine. The Windows 10 machine is a pipsqueak and will not run the Oifs models anyway. Right now those figures for my 1511241 machine are Average upload rate 4796.28 KB/sec Average download rate 25457.84 KB/sec These speeds seem to have gone up since I started getting Oifs work units. I do not know if this means the increased traffic to the CPDN servers caused this, or just that the Internet has speeded up. In either case, I do not understand it. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
Does your Xeon have any IFS tasks left? If it does (or next time you have any, if not), could you: Open BOINC Manager Switch to 'Advanced' view (if not using it already) Watch the 'Transfers' tab as the task progresses You have to be quick - the recent batch had fairly consistent file sizes around 14 MB, and left my machine in around 10 or 11 seconds. The exact time can be checked in the Event log later. That would seem to imply a speed of around 1.2 MB/sec, or 1,200 KB / sec. The figures will flash up on the transfers tab while the transfer is active. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
You have to be quick - the recent batch had fairly consistent file sizes around 14 MB, and left my machine in around 10 or 11 seconds. The exact time can be checked in the Event log later. That would seem to imply a speed of around 1.2 MB/sec, or 1,200 KB / sec. The figures will flash up on the transfers tab while the transfer is active. My transfers seem to be about 5 seconds. Sometimes 4 seconds; sometimes 6 seconds. And any one task seems to send a trickle about every 8 minutes. But even if I leave the 'Transfers' tab displaying, they go by too fast for me to act. HA! I tricked it! Watching the transfer tab, whenever it displayed anything, I hit the Print Screen button. One of them was 14.32 MB, 5826 KBps, ... Now if only we knew the definition of B. 5.826 MBps. If it is bits, my 75 MegaBits per sec fibre-optic Internet can handle it with ease. If it is bytes, then still handle it (46.6 Megabits per second). I conclude that I cannot conclude anything from these data. 8-( Fri 02 Dec 2022 11:47:15 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_51.zip Fri 02 Dec 2022 11:47:20 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_51.zip Fri 02 Dec 2022 11:49:58 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_118.zip Fri 02 Dec 2022 11:50:04 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_118.zip Fri 02 Dec 2022 11:52:32 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_118.zip Fri 02 Dec 2022 11:52:37 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_118.zip Fri 02 Dec 2022 11:54:46 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_52.zip Fri 02 Dec 2022 11:54:51 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_52.zip Fri 02 Dec 2022 11:57:29 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_119.zip Fri 02 Dec 2022 11:57:34 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_119.zip Fri 02 Dec 2022 12:00:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_119.zip Fri 02 Dec 2022 12:00:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_119.zip |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
The easiest way is to temporarily suspend network activity (BOINC Manager again, Activity menu) - this keeps the files on your disk so you can check them with the usual file system tools. The usual location is: /var/lib/boinc-client/projects/climateprediction.netbut YMMV. You have the generic form of the file names from your log. Remember to turn network activity back on when you've satisfied your curiosity! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Do you mean like this? -rw-r--r--. 1 boinc boinc 14962469 Dec 2 12:54 oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_60.zip -rw-r--r--. 1 boinc boinc 14868279 Dec 2 12:55 oifs_43r3_ps_2039_2021050100_123_947_12165128_1_r1193171530_3.zip -rw-r--r--. 1 boinc boinc 14849068 Dec 2 12:57 oifs_43r3_ps_1734_2021050100_123_946_12164823_2_r333244089_3.zip So those files are almost 15 MegaBytes each or 120 Megabits. Since they take an average of 5 seconds to send, 24 Megabits/second. Will easily squeeze through my 75 Megabit/second Fibre-optic Internet link. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,767,175 RAC: 3,168 |
Yup, I think we've reached the definitive answer. But it still doesn't explain your average download speed ... |
Send message Joined: 29 Jan 06 Posts: 1 Credit: 607,579 RAC: 46,960 |
I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz) and it finish as valid after 2 days but the credit is 0 !? https://www.cpdn.org/result.php?resultid=22247911 |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz) The credit script for CPDN is only run once a week, usually on Sundays so wait a couple more days and credit should appear. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
@ Glen. It looks on my Ryzen at least that when only two tasks are running, they don't seem to be many failures. Three or more running at a time, often one out of the three or more will crash. I will tail the file you mentioned a couple of posts back when i get the chance and look at that. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz) You are not alone. Don't worry about it. I have completed 17 of these tasks successfully, no errors, on my main machine, 1511241, and also have no credits assigned yet. I think credits are updated only once a week on weekends. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
xii5ku wrote: (That's on a Xeon W2245 with 8c/16t, 16.5 MB last-level cache, mixed workload with 3x OpenIFS. I suspect that some Rosetta work can be quite cache hungry too. Not sure of any of the others. You could check by )I forgot to finish that sentence. It should have been something like "You could check by looking up a PID of a rosetta worker process and add -p PID to the perf command line". Although the processes which fight over cache affect each others hit/miss rates, of course. A more conclusive way would be to investigate homogeneous workloads. Glenn Carver wrote: Should look at TLB misses, they are expensive. I'm not sure if that's what you mean by a 'cache miss'.perf's "-e cache-misses" appears to count across all cache levels; i.e. those accesses which couldn't be satisfied by any one or another of the cache levels. (There are other performance counters for distinct cache levels.) Re: TLB: Good idea to watch these in general. In my particuar current case of # cpu-time consuming processes = # physical cores, and because Linux doesn't move processes from core to core too frequently, it's not an isse, as TLB is a per-core resource, not shared between cores like e.g. L3$ or memory controllers. That is, each process has got an entire TLB for itself for most of its runtime, and I can't actually improve on that. TLB misses would be good to watch though if I ran more CPU time consuming processes than there are physical cores. Or if I was an OpenIFS developer, rather than just a user of the binary. Glenn Carver wrote: Find the slot directory for the task and find the 'ifs.stat' file, this is the file that the controlling wrapper process is watching. [...] For info the rest of the columns are [...]Thank you for this detailed info. -------- About transfer speed display: IME the speeds shown at the show_host_detail webpage as well those in boincmgr's transfers tab are not very accurate. However, as far as I can tell, the "B" is for Bytes, at both places. -------- About OpenIFS failure modes: My current OpenIFS task count is 195 in progress (of those: 52 uploading), 93 valid, and 54 error, alas. All of the error results come from only one out of three hosts. All three hosts have the same hardware, OS, boinc client configs, and same split workload of OpenIFS and PrimeGrid llrSGS. The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspend–resume cycle. But I don't have a record of this to verify it. The stderr.txts of these tasks are of two types: One type contains just "--". The other shows that the last one to five zip files were missing. The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle. All three of the hosts which I have active at OpenIFS have plenty of RAM, and are set to "leave non-GPU tasks in memory while suspended".¹ That's possibly a factor why they run error-free. ¹) On a side note, my boinc clients would never suspend OpenIFS tasks on their own. I run OpenIFS and llrSGS in two separate boinc client instances, so that I have full control over work buffers and number of running tasks. (If I used the same client instance for both projects, the client could decide to suspend some OpenIFS in favor of more llrSGS.) But I triggered suspend-to-RAM of OpenIFSs once or twice now when I reduced the number of running OpenIFSs in order to cut down the upload backlog. |
©2024 cpdn.org