climateprediction.net (CPDN) home page
Thread 'Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion'

Thread 'Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion'

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
old_user186450

Send message
Joined: 11 May 06
Posts: 4
Credit: 1,008,514
RAC: 0
Message 36143 - Posted: 14 Feb 2009, 18:34:17 UTC - in response to Message 36140.  

... Unless you have a backup prior to the 10th I\'d suggest you abort it.
Good luck with the next model.

Thanks for that - It\'s dead.

I\'ll take a look at \'backups\' although I always believed all the maintainance stuff should really be handled automatically by the BOINC wrapper.

OT slightly - Is there an error FAQ somewhere? It seems that another of my models (7723388) after appearing to be fine has been marked invalid when it uploaded today :(
I\'d like to get to the bottom of it - esp. if I still have a hidden hardware issue.


ID: 36143 · Report as offensive
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 36144 - Posted: 14 Feb 2009, 19:57:16 UTC - in response to Message 36143.  
Last modified: 14 Feb 2009, 19:57:59 UTC

OT slightly - Is there an error FAQ somewhere? It seems that another of my models (7723388) after appearing to be fine has been marked invalid when it uploaded today :(

All the trickles are there, as well as the total model graph. But the stderr out listing shows some problem with the 3rd zip upload. Not sure what to make of it.
Unable to resolve filename cpdnout3.zip

</stderr_txt>
<message>
<file_xfer_error>
  <file_name>hadsm3fub_k85l_005975356_1_3.zip</file_name>
  <error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

I\'d like to get to the bottom of it - esp. if I still have a hidden hardware issue.

If a potential hardware issue, I\'d just run the latest version of prime95 on all 4 cores for a day or two. I typically run the blend test since it\'s more memory intensive, and cpdn is certainly memory dependent.
ID: 36144 · Report as offensive
old_user186450

Send message
Joined: 11 May 06
Posts: 4
Credit: 1,008,514
RAC: 0
Message 36193 - Posted: 23 Feb 2009, 9:07:15 UTC - in response to Message 36144.  

OK - Think I know what happened.
I used the local preferences dialogue to limit the number of CPU cores to use to 3. In the windows version of BOINC this dialog defaults many fields - including those on other tabs - and had limited the disk space available to BOINC to 2GiB. Hence no space available to create/retain the file. Turns out I was right on the 2GiB busage border as other projects also take a bite out of this space. I have subsequently upped the limit to 20GiB to prevent this happening again.
ID: 36193 · Report as offensive
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36194 - Posted: 23 Feb 2009, 16:05:48 UTC
Last modified: 23 Feb 2009, 16:06:29 UTC

Hi David

A couple of your models crashed with -107 errors eg this one. This could perhaps have been caused by a graphics problem completely unrelated to anything else you\'ve mentioned. Have a look in the project READMEs (link in my sig) at the collection about crashes and problems. In that collection, item #6 by Mike and item #7 by Thyme Lawn.
Cpdn news
ID: 36194 · Report as offensive
old_user186450

Send message
Joined: 11 May 06
Posts: 4
Credit: 1,008,514
RAC: 0
Message 36225 - Posted: 26 Feb 2009, 13:37:24 UTC - in response to Message 36194.  

A couple of your models crashed with -107 errors eg this one.


Hi,

Almost 100% certain that that those were caused by dodgy memory - I swapped out 4 sticks of Balistix non-ECC for 4 sticks of ECC around 21st Jan 09. That WU crashed 9th Dec 08. It\'s a pity that there was no visible alert that something was wrong else I\'d have taken corrective action sooner. Any crashes before that point I\'d like to either run again or markup somehow with an explanation but we don\'t have those facilities.

FWIW- A second machine (htpc) also had the same issues but with a completely different graphics card. It too now has ECC RAM and has settled down.

One last thought - I also run the SETI cuda apps on this PC - could they interfere?
ID: 36225 · Report as offensive
old_user498662

Send message
Joined: 29 Jan 08
Posts: 1
Credit: 1,126,397
RAC: 0
Message 36260 - Posted: 1 Mar 2009, 10:04:15 UTC
Last modified: 1 Mar 2009, 10:24:10 UTC

My first Iceworld. I am going to abort it unless you give me a different instructions.


1. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=8357089

2. 153306 of 259248, phase 4/4 (14/10/2059)

3. Hours Elapsed: 0353:52:50 (1.37 s/TS)

4. The Blue Planet. Totally.

5. i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 4] HT: On

6. Yes, overclocked to 3.5GHz. However, PC was 24h prime stable (8 instances)at 3.8Ghz so I consider that 3.5GHz is a safe \"production speed\".
ID: 36260 · Report as offensive
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,405,498
RAC: 10,268
Message 36261 - Posted: 1 Mar 2009, 10:43:20 UTC - in response to Message 36260.  

My first Iceworld. I am going to abort it unless you give me a different instructions.
Your task has slowed to about 15 s/TS on the last trickle, from a regular 1.1 s/TS. It\'s partner task on the same WU 8357085 has started fast processing s/TS from the same place in phase 4.

Looks like a candidate to abort.
ID: 36261 · Report as offensive
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36263 - Posted: 1 Mar 2009, 11:34:05 UTC
Last modified: 1 Mar 2009, 11:52:53 UTC

I see you\'ve now aborted that iceworld, Mican. Its whole workunit #6278040 is rather interesting. A model on a machine with Linux crashed at about the same point. The model Hagar mentioned that seems to have speeded up was on a Mac. It crashed.

Conan has a model from the same workunit running on Linux and Verstapp has one on Intel/Windows, both further behind. Both Conan and Verstapp are contactable. I\'ll ask them to watch their models to see what happens well into Phase 4. It will be particularly interesting to see whether Conan gets an iceworld on Linux.
Cpdn news
ID: 36263 · Report as offensive
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36264 - Posted: 1 Mar 2009, 12:31:21 UTC

David Raison asked

\'One last thought - I also run the SETI cuda apps on this PC - could they interfere?\'

I haven\'t seen any reports that using a video card for GPU crunching causes graphics problems in tasks running at the same time on the CPU. I expect BOINC CUDA tasks, like tasks running on the CPU, also run at low priority, reining back the moment the resource is needed for some other job.

But thanks for raising the question because it\'s a possibility we need to watch out for.
Cpdn news
ID: 36264 · Report as offensive
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 36311 - Posted: 5 Mar 2009, 21:32:59 UTC - in response to Message 36263.  

I see you\'ve now aborted that iceworld, Mican. Its whole workunit #6278040 is rather interesting. A model on a machine with Linux crashed at about the same point. The model Hagar mentioned that seems to have speeded up was on a Mac. It crashed.

Conan has a model from the same workunit running on Linux and Verstapp has one on Intel/Windows, both further behind. Both Conan and Verstapp are contactable. I\'ll ask them to watch their models to see what happens well into Phase 4. It will be particularly interesting to see whether Conan gets an iceworld on Linux.


G\'Day mo.v,
Just an update, I took a quick look at Verstapp\'s work unit as it has reached phase 4 and I am still at phase 1. The last TS that he uploaded has just taken a large drop from a consistent 1.2 s/TS to 1.33 s/TS so perhaps he has also gone into an Ice world ?
ID: 36311 · Report as offensive
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36317 - Posted: 6 Mar 2009, 0:57:08 UTC

I bet you\'re right. I don\'t think he can have seen my PM about it on this forum so I\'ll send him another on the independent one where he visits most days. He has a lot of computers and may not be able to look at all his models\' graphics very often.
Cpdn news
ID: 36317 · Report as offensive
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,405,498
RAC: 10,268
Message 36319 - Posted: 6 Mar 2009, 8:22:11 UTC

The task 8357088 has gone from consistent 1.2 s/TS to 12.4 s/TS on the last trickle at Phase 4 TS 151228.
ID: 36319 · Report as offensive
old_user555464

Send message
Joined: 1 Feb 09
Posts: 1
Credit: 105,577
RAC: 0
Message 36378 - Posted: 14 Mar 2009, 13:30:19 UTC

Hello,

I think i have an iceworld/blue earth problem. I\'ve been running 2 Hadsm3mh models simultaneously, details are:

Model 1
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=8314591
Timestep 159890
2.30s/TS
Temp display is blue
Windows Vista Home Premium: Intel Core 2 Quad Q9400@2.66GHz
No overclocking

Model 2
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=8314609
Timestep 49681
2.83s/TS
Temp display is blue
Windows Vista Home Premium: Intel Core 2 Quad Q9400@2.66GHz
No overclocking

Any thoughts would be appreciated.


ID: 36378 · Report as offensive
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 36379 - Posted: 14 Mar 2009, 13:51:21 UTC - in response to Message 36378.  
Last modified: 14 Mar 2009, 13:53:00 UTC

Yep. It looks like both of them are \"iceworlds\". On one of yours, two other Intel/Windows PCs running models from that work unit also slowed down at the same point. On the other, two other Intel/Windows PCs running models from that work unit haven\'t returned a trickle for awhile, after reaching the point that yours slowed down. Best to abort them. Bad luck to get two of them at the same time. Good luck with your next models.
ID: 36379 · Report as offensive
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,405,498
RAC: 10,268
Message 36380 - Posted: 14 Mar 2009, 14:14:30 UTC - in response to Message 36378.  

Hello, I think i have an iceworld/blue earth problem. ..... Any thoughts would be appreciated.

Your task 8314591 has slowed dramatically since 13th Feb 2009. It is from WU 6270500.
One task from that WU that has completed: 8314609 on Intel/Linux.
Two tasks have slowed dramatically at the same point in phase 4 as yours: 8314585 and 8314586 both on Intel/windows.

Your task 8314609 has slowed dramatically since 20th Feb 2009. It is from WU 6270502.
One task from that WU that has completed: 8314603 on Intel/Linux.
One task from that WU that is near completion: 8314605 on AMD/Vista.
One task hase slowed dramatically at the same point in phase 4 as yours: 8314607 on Intel/windows.

They both look like candidates to abort. Good luck with the new ones.

ID: 36380 · Report as offensive
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36381 - Posted: 14 Mar 2009, 14:38:48 UTC

Graham, you may like to replace one or both aborted iceworlds with the new type of model, HadAM3P. None of these turn into iceworlds as far as we know. You may need to select this model in the climateprediction section of your account. They\'re described in the forum news thread at the top of Number Crunching. Just don\'t let your computer run 4 of them at any time because too many slow each other down. They run very nicely alongside HadSM and HadSMMH models.
Cpdn news
ID: 36381 · Report as offensive
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,405,498
RAC: 10,268
Message 36405 - Posted: 17 Mar 2009, 15:52:08 UTC

Hah, got my own iceworld today !

Task 8418410 on WU 6285424. Two other models are stuck at the same place, all are on intel/windows. One model is still running in phase 4 on Intel/Linux.

One to abort.

ID: 36405 · Report as offensive
ProfileRick B

Send message
Joined: 17 Feb 09
Posts: 31
Credit: 1,505,895
RAC: 529
Message 36420 - Posted: 19 Mar 2009, 20:10:32 UTC

Looks Like I have an Ice world. All Blue on the graphics and occured within the last 24 to 48 hrs.

A link to the model/ResultID webpage

- Task ID is number 7776737

A current timestep of that model (on the globe graphic)

- Time step is 08 08 1830

The s/TS value (on the globe graphic. Remember, you can hit the Z key while viewing the globe and it will give you this additional text/status information.)

- 1.67

Whether the temperature display of the globe graphic is blue.

- Temperature Display is Blue in the -30 to - 36 range

What your processor/CPU and Operating System is (i.e. Intel or AMD on Windows or Linux)

- Intel(R) Pentium(R) Dual CPU E2180 @ 2.00 GHz, 2.00 GB RAM on Windows Vista

Whether you are overclocking.

- No Oveclocking.

Should this model be continued or aborted?

ID: 36420 · Report as offensive
ProfileRick B

Send message
Joined: 17 Feb 09
Posts: 31
Credit: 1,505,895
RAC: 529
Message 36422 - Posted: 19 Mar 2009, 20:17:02 UTC

My apologies about the time step. it is 80997 of 259248
ID: 36422 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 36424 - Posted: 19 Mar 2009, 20:33:01 UTC

I keep a record of the time that trickles are generated, and if I find that a model has missed the scheduled time of a few trickles and the temp is blue, then I just abort it and get another one.
Since upgrading from a P4 to a quad core processor, it\'s just not worth the hassle of restoring a backup for 1 out of 4 models. There are millions of combinations to test. Backups now are only in case I have a power failure or some such. And even then I find that the computer and the models recover OK.

With your model, 2 other crunchers appear to have passed where you were, so with a bit of luck one or more of them will run the full length.

I\'d suggest aborting it.


Backups: Here
ID: 36424 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

©2024 cpdn.org