Message boards : Number crunching : Strange counter on workunit
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
Can somebody pleas explain to me why the cpu time counter on the workunit suddenly jumped down to a much lower number? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7969311 It went from 4,477,594 seconds to 249,438 seconds, but the timestamp kept incrementing. (This is not my computer by the way) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Not sure if it\'s related, but the owner of that computer is still running BOINC version 4.19 All current climate models require a version 5 of BOINC to work properly. Please have them upgrade immediately or stop trying to crunch climate models. They\'re just wasting them. Look at the number of models for this computer and this one. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
That model has at least completed, which is more than most on that machine. There have been a number of reports of CPU time anomalies over the years - all on Linux machines, I think. No-one has got to the bottom of it as far as I know. The problem is benign: the models complete successfully, but with saw-tooth CPU time values and sec/TS. As Les says, upgrading to at least BOINC version 5 is a must. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
If the owner of these computers doesn\'t know a lot about BOINC it would probably be better for them to upgrade to version 5.10.45 than to BOINC version 6. (On the BOINC download page click on \'All versions\'.) I think something may have gone wrong with the processing of that HADAM about 5 trickles before the end. The sec/timestep suddenly went down to a much lower number as if some ar all of the data wasn\'t being processed. This didn\'t happen on the other computers running the same workunit. A BOINC upgrade will probably fix all this. No current CPDN models are compatible with BOINC version 4. We\'ve posted about new versions of BOINC and the models in the News thread which is at the top of Number Crunching. Cpdn news |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
well at least the time series an all of the three computers look the same: But there are slight variations. How much variation is normal? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
How much variation is normal? Only the researchers will know that. The data displayed is only a tiny part of the results, and is only \'eye candy\' for the crunchers. The bulk of the info is analysised by statistics programs to check it\'s validity. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
One thing I\'ve never understood about these HADAM graphs is why they only seem to show results for 11 months although the models run for a whole year. I\'d already looked at the model\'s graphs to see whether there was any abnormality near the end, but none of the models in the WU (or any other HADAMs as far as I know) have the lines extending to the end of March. Cpdn news |
Send message Joined: 8 Nov 06 Posts: 18 Credit: 2,425,895 RAC: 0 |
Graph for 7969311 is slightly different to the other 2 which are identical. Tip If you load only the graph\'s into 3 different tab\'s and flick between the tabs you quickly see the differences. Works on Firefox I assume it works for other browsers too? Dave |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Dave The researchers expect each model in a workunit to produce slightly different results. The longer the model, the more likely it is that differences will develop. It\'s because of the Lorenz \'butterfly\' effect. The slightest difference in computation affects everything that happens after that. We know, for example, that AMDs and Intels handle the calculations in different ways. So the research team treats the results from each model in a WU as unique and uses all the results unless they\'re eliminated by quality control. So I wasn\'t looking for minor differences. I wondered whether the graphs might show missing parts, or the graph line dropping down off the scale, or flatlining. For example, when HADSMs turn into \'iceworlds\', in the few cases where they get far enough to produce a graph for the phase, this is the sort of thing we see. This HADAM discussed in a different thread turned into a looper and has now aborted on two computers. The graphs there show the sort of wacky (non-)results that mean something went seriously wrong. Cpdn news |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
As Dave says, 7969313 and 7969315 are the same. This is because they are both run on Windows/Intel. The third result, 7969311 is different because it is Linux/Intel. A computer is a \'machine\' and, other things being equal, will produce exactly the same results every time. I don\'t buy this butterfly stuff! There are a number of reasons why results from the same work unit differ: 1. They are run by different \'machines\', which are not expected to produce the same results - i.e. the theoretical algorithms are not in practice identical, because of differences in low-level calculations between operating system plus processor combinations. 2. Where the \'machines\' are apparently the same but the results differ, the likely cause is: 2a) An inadequate definition of what a \'machine\' is: operating system plus processor manufacturer may not be a complete enumeration of all machine components that vary. A machine will not then agree with other machines, but it would agree with itself. 2b) The state is corrupted during the run. That corruption amounts to a difference in the algorithms the machines apply: they are no longer the same. There are lots of reasons for this: a faulty PC (failure or overclocked), a programming error related to checkpointing (incomplete state saved, or changed state reloaded), corruption of the installation, and whacky things like cosmic rays or power glitches (causing single-event upsets, rewinds etc.). To produce the \'reference\' run for your machine, you could: - put the PC at the bottom of a mine - add ECC RAM - not overclock agressively - keep the installation folders protected - run the models as quickly as possible - run a diagnostics/stress test, even on a healthy-seeming machine. The project has published an analysis of the variations, and they\'re content that the variations behave as if they were butterfly flaps. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
So my laptop finally returned the result. It is different from the other three, but still fairly similar. It\'s an AMD Turion X2 machine. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
[NewtonianRefractor wrote:] So my laptop finally returned the result. It is different from the other three, but still fairly similar. It\'s an AMD Turion X2 machine. Well done: |
©2024 cpdn.org