Thread 'Lost another one!'

Author	Message
Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69119 - Posted: 5 Jul 2023, 19:38:07 UTC Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident? ID: 69119 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 69120 - Posted: 6 Jul 2023, 8:20:37 UTC - in response to Message 69119. Last modified: 6 Jul 2023, 8:21:18 UTC Did you take a note of the task number? Given the proportion of the current batch of tasks that are failing for everyone (due to the nature of the data) finding an "odd ball" error message would be far easier if its task number were part of your reporting message. ID: 69120 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 69124 - Posted: 6 Jul 2023, 9:09:15 UTC I find that a power cut is good for killing w/u as well. All other projects w/u recover ok. ID: 69124 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 69127 - Posted: 6 Jul 2023, 11:16:14 UTC - in response to Message 69119. Last modified: 6 Jul 2023, 11:18:39 UTC Please please please fix the program so it doesn't get upset with a restart. This is absurd! The computer did not crash. I turned it off with a perfectly normal Windows shutdown (I forgot that machine had a WAH task running). Turned it back on and I get a computation error. Fix your code. For goodness sake, surely there are checkpoints written? Surely you write the next one before deleting the old one? So even if I yank the plug out of the wall, it should continue from the last checkpoint. I couldn't write a program that badly if I tried. How has someone done it by accident? 1/ The model creates restart files perfectly fine. 2/ If boinc (not the model) forces the model to quit before it has properly closed it's files, it will crash on a restart. In other words, it's not the model, it's the environment it's running in. The same thing can happen (and does) on linux. --- CPDN Visiting Scientist ID: 69127 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 69128 - Posted: 6 Jul 2023, 11:24:06 UTC - in response to Message 69124. I find that a power cut is good for killing w/u as well. All other projects w/u recover ok. That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger. ID: 69128 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69130 - Posted: 6 Jul 2023, 14:21:37 UTC - in response to Message 69127. Last modified: 6 Jul 2023, 14:34:25 UTC 1/ The model creates restart files perfectly fine. 2/ If boinc (not the model) forces the model to quit before it has properly closed it's files, it will crash on a restart. In other words, it's not the model, it's the environment it's running in. The same thing can happen (and does) on linux. No, the model shouldn't go wrong just because the files don't get closed. It should always have a stable state to restart from on the disk. Think worst case scenario, powercut or complete OS crash/lockup. Very important for tasks running for 2 weeks to 2 months. When writing new data, keep the last one until the new one is complete. Likewise, I don't erase the USB disk I backup my computer to, then backup to it. Because during the backup I now have no backup. I cycle two USB disks, you should cycle two data files. ID: 69130 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69131 - Posted: 6 Jul 2023, 14:23:40 UTC - in response to Message 69128. Last modified: 6 Jul 2023, 14:24:17 UTC That's not true in my experience. I've had LHC tasks fail on restart. CPDN model's are probably more susceptible as I suspect their restart filesizes are bigger. LHC is terrible because of Virtualbox. Virtualbox lives in a world of it's own and doesn't do what you tell it. If I have LHC tasks running and tell Windows to restart, windows tells me "Virtualbox still has active connections". Nobody has been able to tell me what on earth that means, so I click "restart anyway". I do let the "Virtualbox is saving state" go away first. But it's no big deal with LHC because the tasks aren't huge so you don't lose as much work. CPDN needs to fix this. ID: 69131 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69132 - Posted: 6 Jul 2023, 14:32:32 UTC - in response to Message 69120. Did you take a note of the task number? Given the proportion of the current batch of tasks that are failing for everyone (due to the nature of the data) finding an "odd ball" error message would be far easier if its task number were part of your reporting message. https://www.cpdn.org/result.php?resultid=22326503 and https://www.cpdn.org/result.php?resultid=22321374 ID: 69132 · Reply Quote