Message boards : Number crunching : Slow progress rate for HadAM4 at N216
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139 |
Hello, I decided to run the project on 32-bit Linux installed in VM. I dedicated only one core of i3-2100 to VM and 4GB of RAM. I got one HadAM4 at N216 but the progress rate is really slow. After nearly 6hrs of running the project, I completed only 0.84%. Should I expect such long-running time for this task or is it just my host? |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
That would be over 25 days per work unit, which is a bit slow for that CPU. I would expect at least twice that fast. Are you running other projects? Normally VBox does not exact much of a penalty, but maybe it does not work well with N216. The caching requirements for N216 are a bit strange. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I got one HadAM4 at N216 but the progress rate is really slow. After nearly 6hrs of running the project, I completed only 0.84%. Should I expect such long-running time for this task or is it just my host? My machine is a 64-bit 1.8 GHz 4-core Intel Xeon. Pretty fast when I got it long ago, but about 1/2 the speed of current machines. I am running an hadam4h N216 process in one core, an hadam4 N144 process in a second core, and two hadcm3s processes in the other two cores. The hadam4h process has 204 hours on it and is 38% done. The hadam4 process has 293 hours on it and is 74% done. One hadcm3s process has 226 hours on it and is 54% done. The other hadcm3s process has 223 hours on it and is 53% done. These figures are as reported by my (old 7.2.33) version of the boinc client. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hal Did you read the message from 3 weeks ago, from the project co-ordinator about these? HadAM4 at N216 resolution |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Also worth noting that if your computer gets switched off before the first checkpoint, it will start again from scratch. Looking at statistics for these tasks, machines without the necessary 32bit libraries are probably still a bigger problem but I would be very surprised if some go for a very long time before being returned -if ever though I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. Are there a lot of these? Would not all work units, not just hadam4* work units, crash because of this? Is there any way the boinc server for ClimatePrediction to detect if libraries are absent (perhaps by analysis of failures) to determine this, and to refrain from sending 32-bit work units to machines lacking 32-bit libraries? |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Are there a lot of these? Would not all work units, not just hadam4* work units, crash because of this? My theory is that the people who haven't installed the 32bit libraries are the same ones that might not notice their tasks starting again from scratch each time they turn their computer on. The openifs tasks won't crash because they are 64 bit. Is there any way the boinc server for ClimatePrediction to detect if libraries are absent (perhaps by analysis of failures) to determine this, and to refrain from sending 32-bit work units to machines lacking 32-bit libraries? The project has been asked about his and the alternative of sending the libraries out with the tasks but I guess there isn't an easy way to do it because they haven't. There are other projects where the libraries are needed too and none of them seem to have resolved this issue either. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Hello, What I would do is make sure boinc preferences has "leave non-GPU tasks in memory when suspended" checked, and "Suspend when computer in use" unchecked, and "Use at most xx% of computer time set to 100%". This should minimize interruptions. Even with those options, the time between checkpoints will be long (3-4 hrs?). With only 3 MB of L3 cache on that CPU, that's marginal for decent performance on this model. If that cache is being shared with other processes on the PC regularly, it might slow it down even more. |
Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139 |
That would be over 25 days per work unit, which is a bit slow for that CPU. I would expect at least twice that fast. Exactly. Nearly a month of continuous crunching which I am not capable of doing at the moment. At current speed it would take me more than that. Anyway, I installed 32-bit Debian based CLI version of Linux and I simply save current state of VM before turning of VirtualBox. It should not affect progress of currently processed task. Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. What stroke me else is that I crunched few wah2 WUs on Windows host with Intel Celeron 2.16Ghz on board and usually I needed around 7 days to complete those. And last thing, I am familiar with the announcement Les Bayliss mentioned in his post but I did not see any clear indication of how long the task would run. Looking at running times posted by Jean-David Beyer indicates that I might end up crunching for more than a month at current speed. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. Hummm. The LHC VBox tasks will take a lot of memory, at least CMS and ATLAS. When you suspend them, if you have "leave applications in memory" enabled, they will hang around in memory. So I wouldn't run them at all. I don't even try to run the native LHC tasks on my machines with a lot more memory and cache. Just exit LHC entirely and run CPDN for a while. Good luck with the switch out of VBox. I have never attempted such a thing. |
Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139 |
Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. I treat this as experiment. Nothing more. I tried once to run CPND on 64-bit Linux but either it wasn't working at all or tasks where crashing unexpectedly after some time. I might try again in the future. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Looking at running times posted by Jean-David Beyer indicates that I might end up crunching for more than a month at current speed. Bear in mind my processor is an old GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7] processor that was pretty fast when I bought it, but runs at about 1/2 the speed of current machines. It does have a relatively large on-chip cache of 10240 KBytes, and 16 GBytes of RAM -- 8 modules of 2GB DMS Certified Memory DDR3-1333 (PC3-10600) 256x72 CL9 1.5v 240 Pin ECC Registered DIMM My hadam4h is taking 52.3026 sec/TS My hadcm3s is taking 22.7600 #1 My hadcm3s is taking 22.7597 #2 My hadam4 is taking 25.8856 The N216 model seems to be running twice as fast as the other two, but I am not sure I believe that. I figure out two or three weeks apiece, but I have not been running these larger tasks for very long. I have not even been running any CPDN work units in a long time because I run Linux on this machine. I do not mind how long these work units take. In the past, I have run work units that had three phase to them and took several months apiece. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
I run 4 HadAM4h on my i7-4790 with 16GB RAM. They all are above 75%, have run >9 days with estimated 3 days remaining. the HDD write is around 145 GB for 20 hours. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I run 4 HadAM4h on my i7-4790 with 16GB RAM. I am seeing the same thing on my i7-4790. With four cores running on HadAM4h, I have 12+ days total (50% completed). It seems that four cores work the best on all my Intel machines (also i7-8700 and i7-9700), regardless of the total number of real or virtual cores. Ryzen is another matter, and I am still chasing that one down. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,708,504 RAC: 5,725 |
the HDD write is around 145 GB for 20 hours. I need to correct this to 14.5 GB (2x7400 MB) for 20 h, which is much better for the HDD. Checkpoints are at around 2.5h with no UPS it is too long for my taste. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I run 4 HadAM4h on my i7-4790 with 16GB RAM. My processor is 4-core 64-bit 1.8 GHz Xeon with 10240 KBytes cache, and 16 GBytes of RAM. I run one hadam4h currently getting 98.8% of a CPU. 153 hours to go. 234 hours run. II run two hadcm3h currently getting 98.1% each of a CPU About 343 hours to go. 254 hours run. I run one hadam4 currently getting 97.6% of a CPU. 230 hours to go. 323 hours run. They all get a little bit more CPU time when I am not running Boinc Manager, Firefox web browser, and a coupla little processes. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
the HDD write is around 145 GB for 20 hours. I have been told the checkpoint interval will be lower for these in the future but I don't know by how much. |
Send message Joined: 2 Feb 05 Posts: 11 Credit: 983,334 RAC: 6,066 |
Jean-David Beyer wrote: I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. You probably don't want to do that: 1) Project-wise, this isn't a big problem. These tasks error out almost immediately, get sent back to the server, and are quickly turned around to go out to other hosts. This doesn't affect the project's overall throughput significantly, nor does it significantly impact the ability of good hosts to get work. 2) This is a problem users can, and do fix. You don't want to block the host permanently. You don't want to even block it temporarily because the inability to get tasks makes it impossible for a user to fix the problem. if you lock out such a host, you're actually contributing to the problem by making it harder for users to correct the problem! Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG. |
Send message Joined: 27 Apr 13 Posts: 4 Credit: 7,391,230 RAC: 2,474 |
I had two machines that didn't have the 32-bit libraries installed and they didn't get any work sent to them despite hitting the update button multiple times during the day. Once I installed the indicated libraries and rebooted the machine, I got work immediately after hitting the update button. Currently running 25 N216 units on one of the machines and they seem to be well behaved so far. Some are at 238 hours with 99 hours to go. Most should end under 400 hours. |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
1) Project-wise, this isn't a big problem. These tasks error out almost immediately, get sent back to the server, and are quickly turned around to go out to other hosts. This doesn't affect the project's overall throughput significantly, nor does it significantly impact the ability of good hosts to get work. Not sure I agree, Batch 843 already has 11% of its tasks down as hard failures, I.E. all three attempts to complete the work unit have failed. If nothing else, this means to get sufficient results, 10% or more extra work units need to be generated and sent out in order to get sufficient results back. I don't know if it would be worth giving the tasks dependent on these libraries four or five attempts before being designated hard failures compared to the normal three or not. It would mean those that failed for other reasons, possibly after many days computing would tie up more computers. 2) This is a problem users can, and do fix. You don't want to block the host permanently. You don't want to even block it temporarily because the inability to get tasks makes it impossible for a user to fix the problem. if you lock out such a host, you're actually contributing to the problem by making it harder for users to correct the problem! In the past, machines have been blocked and messages sent to the user that once they confirm they have installed the missing libs, reset so they can get work again. |
©2024 cpdn.org