climateprediction.net (CPDN) home page
Thread 'HadCM3 160-year runtime estimate wrong?'

Thread 'HadCM3 160-year runtime estimate wrong?'

Message boards : Number crunching : HadCM3 160-year runtime estimate wrong?
Message board moderation

To post messages, you must log in.

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 32090 - Posted: 9 Jan 2008, 10:45:27 UTC

I\'ve just downloaded two new HadCM3 models. I got 160-year models (7177229, 7177246), replacing 80-year models which have just finished on both machines.

Yet both machines are telling me that the models will take about 48 days, the same as the 80-year models: that estimate is about right for 80 years at 2.0/2.1 s/TS, but it should be 96 days for 160 years. BOINC is v5.10.13 under Windows XP.
ID: 32090 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 32095 - Posted: 9 Jan 2008, 12:15:23 UTC - in response to Message 32090.  
Last modified: 9 Jan 2008, 12:16:45 UTC

I\'ve just downloaded two new HadCM3 models. I got 160-year models (7177229, 7177246), replacing 80-year models which have just finished on both machines.

Yet both machines are telling me that the models will take about 48 days, the same as the 80-year models: that estimate is about right for 80 years at 2.0/2.1 s/TS, but it should be 96 days for 160 years. BOINC is v5.10.13 under Windows XP.


I also got a 160 year model, and it also displays about the same time to complete as the 80 years models, I crunched previously.
The deadline is sufficiently far away, so it is not much of a problem to the BOINC manager, I think. Yes,I do know deadlines don\'t matter on CPDN, but the manager doesn\'t know that. ;)
ID: 32095 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32101 - Posted: 9 Jan 2008, 18:51:15 UTC


Last I heard, the deadline for the 160 year models was going to be made just under 3 years. (There\'s a BOINC problem if it\'s made longer than 3 years.)
Best to leave it for a year before worrying.

ID: 32101 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 32103 - Posted: 9 Jan 2008, 19:02:17 UTC - in response to Message 32101.  


Last I heard, the deadline for the 160 year models was going to be made just under 3 years. (There\'s a BOINC problem if it\'s made longer than 3 years.)
Best to leave it for a year before worrying.

The deadline for this morning\'s models is 10 July 2010 (2.5 years away), which sounds about right. It\'s the To completion estimate which seems to have gone wrong - either as a consequence, or otherwise.
ID: 32103 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32104 - Posted: 9 Jan 2008, 19:28:04 UTC


Ahhh! I just thought of something.

If there is only one \"result duration correction factor\" for ALL TCM\'s, then it may be set for the 80 year models. If so, then BOINC is in for a rude shock. And you should prepare for lots of error messages as you near the current \"estimate time\".

I\'ll ask about it.

ID: 32104 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32105 - Posted: 9 Jan 2008, 21:56:59 UTC


Ah HAAA!

Early reply:
There\'s only one RDCF per host per project, no matter what kinds of applications they run on that host.

Also, the RDCF can be edited, BUT ......

If, like me, people run several different types of models on one computer, (in my case, a quad core), this is not a solution. Luckily, I\'m running one HADAM3, and 3 HADSM3s, which have a similar completion time, so BOINC won\'t get too flustered.

At the moment, all the advice that I can offer is Don\'t Panic (in big friendly letters. )

No doubt more latter.

ID: 32105 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 32106 - Posted: 9 Jan 2008, 22:52:07 UTC

Then, with apologies, we have to get technical:

Here are two snippets from client_state.xml:

<workunit>
    <name>hadcm3iozn_cpcx_2000_80_135898634</name>
    <app_name>hadcm3i</app_name>
    <version_num>544</version_num>
    <rsc_fpops_est>7000000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>21000000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>134217728.000000</rsc_memory_bound>
    <rsc_disk_bound>629145600.000000</rsc_disk_bound>
.....


<workunit>
    <name>hadcm3istd_00uh_1920_160_15922618</name>
    <app_name>hadcm3i</app_name>
    <version_num>544</version_num>
    <rsc_fpops_est>7000000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>21000000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>134217728.000000</rsc_memory_bound>
    <rsc_disk_bound>629145600.000000</rsc_disk_bound>
.....


Those are the two active tasks on host 788869. You can see from the names that the first is an 80-year model, and the second is a 160-year model. Yet they both have the same <rsc_fpops_est> (which is what BOINC uses, along with RDCF and benchmarks, to calculate \'time to completion\'): and, more alarmingly, they have the same <rsc_fpops_bound> (which could, in extreme cases, cause the tasks to fail with a \'maximum CPU time exceeded\', or whatever the wording is).

Something needs a bit of a tweak in the workunit generation department.
ID: 32106 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32109 - Posted: 9 Jan 2008, 23:19:05 UTC


I\'ve sent a PM to the project people to get them across the situation.

ID: 32109 · Report as offensive     Reply Quote
old_user3

Send message
Joined: 5 Aug 04
Posts: 173
Credit: 1,843,046
RAC: 0
Message 32130 - Posted: 11 Jan 2008, 12:15:36 UTC - in response to Message 32109.  

You\'re absolutely right. The initial duration estimate should be double for the 160 WU.
I apologize for this mistake on our part.
I\'ve corrected this. You can edit the client state file to reflect the correct values below.
(i.e just for the 160 WU)

80 WU
<rsc_fpops_est> \"7000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"21000000000000000\" </rsc_fpops_bound>

160 WU
<rsc_fpops_est> \"14000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"42000000000000000\" </rsc_fpops_bound>

Sorry once again and thanks for pointing this out.
ID: 32130 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,699,166
RAC: 9,972
Message 32131 - Posted: 11 Jan 2008, 12:22:29 UTC - in response to Message 32130.  
Last modified: 11 Jan 2008, 12:31:10 UTC

You\'re absolutely right. The initial duration estimate should be double for the 160 WU.
I apologize for this mistake on our part.
I\'ve corrected this. You can edit the client state file to reflect the correct values below.
(i.e just for the 160 WU)

80 WU
<rsc_fpops_est> \"7000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"21000000000000000\" </rsc_fpops_bound>

160 WU
<rsc_fpops_est> \"14000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"42000000000000000\" </rsc_fpops_bound>

Sorry once again and thanks for pointing this out.

Thanks, Tolu.

Memo to other users - stop BOINC completely, and take an extra backup for luck, before attempting to edit client_state.xml

[edit - and keep the formatting as in my message 32106 - i.e., no quotation marks, and six decimal places.]
ID: 32131 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32132 - Posted: 11 Jan 2008, 14:23:51 UTC - in response to Message 32131.  

You\'re absolutely right. The initial duration estimate should be double for the 160 WU.
I apologize for this mistake on our part.
I\'ve corrected this. You can edit the client state file to reflect the correct values below.
(i.e just for the 160 WU)

80 WU
<rsc_fpops_est> \"7000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"21000000000000000\" </rsc_fpops_bound>

160 WU
<rsc_fpops_est> \"14000000000000000\" </rsc_fpops_est>
<rsc_fpops_bound> \"42000000000000000\" </rsc_fpops_bound>

Sorry once again and thanks for pointing this out.

Thanks, Tolu.

Memo to other users - stop BOINC completely, and take an extra backup for luck, before attempting to edit client_state.xml

[edit - and keep the formatting as in my message 32106 - i.e., no quotation marks, and six decimal places.]


Good spot, Richard. An interesting sideline to this one is that, for the past 12 hours the \"Time to complete\" in Boinc Manager for my 160 year model has been slowly rising (i.e. before I edited the .xml). After 36 hours crunching it was showing about 706 hours to complete, after 46 hours (just before the edit) it was showing 710 hours to complete. Following the edit (still 46 hours in) it is showing 1376 hours to complete - and reducing as expected.

Perhaps there is a mechanism that would have prevented the Armageddon that Les was warning of?

F.
ID: 32132 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32134 - Posted: 11 Jan 2008, 18:21:13 UTC

Done mine: it looks a lot more sensible now.

Note that there is a group of bounds for each model and it\'s only the fpops bit that needs changing, and only for the 160-year HADCM3 models.

Thanks, all.
ID: 32134 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32138 - Posted: 11 Jan 2008, 19:55:02 UTC
Last modified: 11 Jan 2008, 20:05:10 UTC

There\'s a click-by-click explanation of how to edit numbers in the xml file in this post:

http://www.climateprediction.net/board/viewtopic.php?t=7215

Don\'t bother reading the explanation at the beginning of that post if you don\'t want to. It was in any case written for members moving their model to a faster computer. But the editing method is the same.

Go straight to the How to fix it section of the post. If you find that the values for your model are already what Tolu says they should be 4 posts above this, you don\'t need to edit anything.
Cpdn news
ID: 32138 · Report as offensive     Reply Quote

Message boards : Number crunching : HadCM3 160-year runtime estimate wrong?

©2024 cpdn.org