climateprediction.net (CPDN) home page
Thread 'Lost another one!'

Thread 'Lost another one!'

Message boards : Number crunching : Lost another one!
Message board moderation

To post messages, you must log in.

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69119 - Posted: 5 Jul 2023, 19:38:07 UTC

Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident?
ID: 69119 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 69120 - Posted: 6 Jul 2023, 8:20:37 UTC - in response to Message 69119.  
Last modified: 6 Jul 2023, 8:21:18 UTC

Did you take a note of the task number?
Given the proportion of the current batch of tasks that are failing for everyone (due to the nature of the data) finding an "odd ball" error message would be far easier if its task number were part of your reporting message.
ID: 69120 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 69124 - Posted: 6 Jul 2023, 9:09:15 UTC

I find that a power cut is good for killing w/u as well. All other projects w/u recover ok.
ID: 69124 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69127 - Posted: 6 Jul 2023, 11:16:14 UTC - in response to Message 69119.  
Last modified: 6 Jul 2023, 11:18:39 UTC

Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident?

1/ The model creates restart files perfectly fine.
2/ If boinc (not the model) forces the model to quit before it has properly closed it's files, it will crash on a restart.

In other words, it's not the model, it's the environment it's running in. The same thing can happen (and does) on linux.
---
CPDN Visiting Scientist
ID: 69127 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69128 - Posted: 6 Jul 2023, 11:24:06 UTC - in response to Message 69124.  

I find that a power cut is good for killing w/u as well. All other projects w/u recover ok.
That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger.
ID: 69128 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69130 - Posted: 6 Jul 2023, 14:21:37 UTC - in response to Message 69127.  
Last modified: 6 Jul 2023, 14:34:25 UTC

1/ The model creates restart files perfectly fine.
2/ If boinc (not the model) forces the model to quit before it has properly closed it's files, it will crash on a restart.

In other words, it's not the model, it's the environment it's running in. The same thing can happen (and does) on linux.
No, the model shouldn't go wrong just because the files don't get closed. It should always have a stable state to restart from on the disk. Think worst case scenario, powercut or complete OS crash/lockup. Very important for tasks running for 2 weeks to 2 months. When writing new data, keep the last one until the new one is complete. Likewise, I don't erase the USB disk I backup my computer to, then backup to it. Because during the backup I now have no backup. I cycle two USB disks, you should cycle two data files.
ID: 69130 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69131 - Posted: 6 Jul 2023, 14:23:40 UTC - in response to Message 69128.  
Last modified: 6 Jul 2023, 14:24:17 UTC

That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger.
LHC is terrible because of Virtualbox. Virtualbox lives in a world of it's own and doesn't do what you tell it. If I have LHC tasks running and tell Windows to restart, windows tells me "Virtualbox still has active connections". Nobody has been able to tell me what on earth that means, so I click "restart anyway". I do let the "Virtualbox is saving state" go away first. But it's no big deal with LHC because the tasks aren't huge so you don't lose as much work. CPDN needs to fix this.
ID: 69131 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69132 - Posted: 6 Jul 2023, 14:32:32 UTC - in response to Message 69120.  

Did you take a note of the task number?
Given the proportion of the current batch of tasks that are failing for everyone (due to the nature of the data) finding an "odd ball" error message would be far easier if its task number were part of your reporting message.

https://www.cpdn.org/result.php?resultid=22326503 and https://www.cpdn.org/result.php?resultid=22321374
ID: 69132 · Report as offensive     Reply Quote

Message boards : Number crunching : Lost another one!

©2024 cpdn.org