Message boards : Number crunching : UK Met Office HadAM4 at N144 resolution
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 27 Aug 04 Posts: 15 Credit: 1,163,443 RAC: 0 |
Usually there's some clue, however subtle, in the message log when work should be sent but isn't. Hn, I doubt that, since the debug-option isn't/wasn't activated by default, and they are as far as I know more hintful than the usual boinc manager messages. As already said, the only message which appeared was "No work sent". If there would have been anything else, I would have said it. Besides, I can't copy/paste anything here anymore - I turned the whole thing off and am posting from work now. And since there is no work anymore it's useless anyway later... Dammit - I would've liked to crunch again, it's so long ago I got cpdn work... It was far easier as they all did run on Windows... :-( Life is Science, and Science rules. To the universe and beyond Proud member of BOINC@Heidelberg My BOINCstats-Sig |
Send message Joined: 27 Aug 04 Posts: 15 Credit: 1,163,443 RAC: 0 |
Well, I was quite surprised as I saw the new work this morning - and I finally got some WUs fom the N144 ones. Thanks a lot! :-) Life is Science, and Science rules. To the universe and beyond Proud member of BOINC@Heidelberg My BOINCstats-Sig |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Finally into the batch 848, which the researcher is keen to get run and returned. They're going to be about a fifth the run time of the N216, but the zips are about the same size as batch 842. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Finally into the batch 848, which the researcher is keen to get run and returned. I will do what I can to give them priority. But if we could select the projects, I could do a better job of it. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Would using HT with 848 benefit the output or I should keep on using real cores only? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Others who have actual experience will correct me if I am wrong but I would guess that hyperthreading will increase total throughput on these. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Would using HT with 848 benefit the output or I should keep on using real cores only? Hummm. Good question. In general, hyper-threading helps as Dave said. So I would try it. But on the larger ones (N216), it hurt. That was not because there was anything wrong with HT itself, but you were running out of cache memory (on the CPU) with so many work units running at once. So in that case, it helped to limit the number running. I will try it both ways shortly. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
In general, hyper-threading helps as Dave said. So I would try it. I used to have a machine with two 32-bit 3.06 GHz Xeon processors that could be hyperthreaded, so it appeared as having 4 processors. I do not recall how much cache those two chips had. I used to run Seti@home, climateprediction, rosetta, and WCG. I tended to run three climatepredictions and one other. Now hyperthreading four processors (i.e., with hyperthreading turned on) would turn out more work than two, but not twice as much. So each task proceeded more slowly that way, but the total tasks per a day was more. My current (slow 1.800 GHz) processor has four 64-bit cores, but 10240K of cache. I cannot hyperthread them. I run Linux. I am currently running four N216 processes and they are getting 92%, 92%, 96%, and 97% of a processor. It is taking Average (sec/TS) 53.6570, but it runs so slowly that I do not wish to stop two of the processes to see if this would improve the cache hit ratio. It seems to take almost three weeks for me to do an N216 task, and that N144s ran faster. Average (sec/TS) 25.8696, taking me about two weeks. I suspect that since these processes are in a big loop, that they are probably running the same code, so the instructions in the cache may only be in there once (once the program gets started, say after a few hours). So cache misses may be less of a problem that at first appears. This would not apply if one were running different applications (such as WCG, or even hadcm3s). Of course the data will be different, and that will increase the probability of a cache miss. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I suspect that we have to start the research again, but I'll see if I can find out anything from The Man himself. AFTER the weekend. In the meantime, I'm running 3 on each computer. The 4th on each machine is a N216, which I Suspended while downloading. I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244. I think we are out of N216 at the moment, but that is not a long-term solution. If we can't choose work units, then we will have to find some sort of compromise setting that works most of the time, whatever that is. I will be curious to see what happens when OpenIFS then comes along. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244. Not right now; I assume you mean N144. There seem to be no N216 tasks at the moment. UK Met Office HadAM4 at N144 resolution 1808 1640 126.25 (64.58 - 329.88) 12 UK Met Office HadAM4 at N216 resolution 0 3862 392.06 (128.25 - 655.58) 29 |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244. As I type server status page showing no N216 tasks but 3842 of them on various computers and doubtless some of these will fail for the reasons we all know about and as the maximum number of attempts has gone up to 5, some will reappear. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Jim1348 Apparently the N144 should have lower memory requirements than the N216, because of the lower resolution. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Since you asked, I can give you my results thus far. I am running the N144 on two identical Ryzen 2600 machines (Ubuntu 18.04.3). Machine 1 is running six cores (50% of the total): at 48% complete, it is estimating 3.15 days total. Machine 2 is running all 12 cores (100% of the total): at 19% complete, it is estimating 7.56 days total. So you are better off running with "full" cores (half the total), especially considering the reduced memory requirements will allow for more N216 and also (gasp!) OpenIFS. That will vary somewhat on different machines, but 50% of the total number of cores is a good guess for machines with hyper-threading. For machines with full cores, you could probably use more, but I still need to cut it down on my i7-9700, which has 8 full cores. YMMV. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Apparently the N144 should have lower memory requirements than the N216, because of the lower resolution. They do. On my Linux 64-bit machine with 16 GBytes RAM, they each required about 4% of my RAM*; the N216 ones each require between 8.5% and 8.6% (1.3 GBytes). _____ * I just realized, I do not remember if I was running the N144 tasks with 8 GBytes RAM or 16, but if it was 8, the difference in size would be even greater. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Ok, I will experiment on my i7-4790 at 75% or 6 cores. Currently I have two N216 and four N144 so I would not push it to 100% per cent. I will monitor how sec/TS changes. On 4 cores only, N144 runs for 3d22h at 13 sec/TS, while N216 is ready in 12 days at 30-31 sec/TS. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Had my first failure of one of these big ones. hadam4_a1yz_201410_6_848_011925730_1 Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. So those starting parameters aren't viable. :( |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Had my first failure of one of these big ones. I've had a couple of those from this batch. One right near the beginning\, which also happened to another task in this work unit. The other was after two trickles. Still waiting to see if the wingman will crash at the same progress in that one. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
Ok, I will experiment on my i7-4790 at 75% or 6 cores. So the two N216 run differently as expected the old one (4real core) still runs at around 30 sec/TS after 3 trickles, might drop for the 3rd the new one (6HT) runs at 39 sec/TS and will end for 16.4 days (12 on 4 cores) The four N144 also run differently as expected the two old ones started at 13 sec/TS now are at 18 so 28% slower the two new ones are at 20 sec/TS from the start so >35% slower Not sure whether it is worth running HT |
Send message Joined: 18 Feb 17 Posts: 81 Credit: 14,024,464 RAC: 5,225 |
Hello. A few questions regarding this very interesting topic. So I've got a Ryzen 1700x here that has just attached to the project and is currently only crunching one N144. From all I've gathered, I need 4 mb of l3 cache per WU whether it is the n144 or n216 for more efficient runtimes? So I should create my app config to only allow for 4 at once on this machine, since it has 16 total. Or should I try for 8 of the real cores, since I'm also reading that simply using all real cores instead of hyperthreading helps. With that being said, am I able to let Rosetta and / or WCG use the remaining 8-12 threads without penalizing these workunits? The machine has lots of ram available, 64 gb. I'm assuming n144 have shorter runtimes than the 216? This is my first time crunching these. How about an i7-4770 - with only 8 MB total. Can I go slightly over the 4 mb each without huge repercussions e.g. running 3 or 4 at once, letting Rosetta take over the other threads? Oddly enough my 2600 sandy bridge seems to be handling itself well enough with the CPU seeming to peg itself at 99+^ per task, according to boinc tasks, running 8 at once. Unless this will end up dropping off steeply or I'm not reading correctly. All of these just got tasks today, so of course boinc and its wildly fluctuating estimations won't settle for a while. Still, interesting reading. How does the new WCG rainfall tasks use a CPU and should I be limiting them, too? Not that there are many to grab. |
©2024 cpdn.org