Message boards : Number crunching : Lost another one!
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident? |
Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073 |
Did you take a note of the task number? Given the proportion of the current batch of tasks that are failing for everyone (due to the nature of the data) finding an "odd ball" error message would be far easier if its task number were part of your reporting message. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
I find that a power cut is good for killing w/u as well. All other projects w/u recover ok. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident? 1/ The model creates restart files perfectly fine. 2/ If boinc (not the model) forces the model to quit before it has properly closed it's files, it will crash on a restart. In other words, it's not the model, it's the environment it's running in. The same thing can happen (and does) on linux. --- CPDN Visiting Scientist |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I find that a power cut is good for killing w/u as well. All other projects w/u recover ok.That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
1/ The model creates restart files perfectly fine.No, the model shouldn't go wrong just because the files don't get closed. It should always have a stable state to restart from on the disk. Think worst case scenario, powercut or complete OS crash/lockup. Very important for tasks running for 2 weeks to 2 months. When writing new data, keep the last one until the new one is complete. Likewise, I don't erase the USB disk I backup my computer to, then backup to it. Because during the backup I now have no backup. I cycle two USB disks, you should cycle two data files. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger.LHC is terrible because of Virtualbox. Virtualbox lives in a world of it's own and doesn't do what you tell it. If I have LHC tasks running and tell Windows to restart, windows tells me "Virtualbox still has active connections". Nobody has been able to tell me what on earth that means, so I click "restart anyway". I do let the "Virtualbox is saving state" go away first. But it's no big deal with LHC because the tasks aren't huge so you don't lose as much work. CPDN needs to fix this. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Did you take a note of the task number? https://www.cpdn.org/result.php?resultid=22326503 and https://www.cpdn.org/result.php?resultid=22321374 |
©2024 cpdn.org