Message boards : Number crunching : Compute Errors / Bad Work Units?
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
It looks like there are some recently geerated WU's out there that are bad: 8559375 8559214 Most hosts are crashing shortly after the start with similar stderr output: <core_client_version>7.0.64</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch 2048 . . . Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Oh great. OK, another email. :( Thanks for the heads up. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Same with me on: WU 8559724 Task 15941889 It ran a total of 22secs and other peoples tasks also failed. 3 immediately, one ofter 30mins. Regards, Martin |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
...my machine made it to 96% on work unit 8537815 before keeling over with an exit code of 22...i think it was perhaps OS confusion caused by other things that were running, but don't know for sure...other machines gave up much sooner... while looking at other folks progress on this work unit, i noticed that machine 1105670 has crashed on nearly every work unit it has attempted...any obvious reason for this situation ??? frank |
Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0 |
For what it's worth: 8549908 has been running for well over 400 hours and I have at least 20 trickles. Just lucky, I guess... |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
My best guess is that machine 1105670 needs to reset the project. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Old Athlon XP, not compatible with HadCM3N tasks. Thanks for pointing it out! I'll notify Andy so he can cut the machine's water off. [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
These might be "typical" model crashes, but the stderr output is different than I've seen in my limited CPDN experience. Just pointing them out in case they are new/relevant. Task 15942401 -- Model crashed: ATM_DYN : INVALID THETA DETECTED. Task 15930887 -- Model crashed: INITTIME: Atmosphere basis time mismatch Task 15899492 -- terminate called after throwing an instance of 'St9bad_alloc' |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
INVALID THETA A standard message. Means that the physical environment became impossible. (The gravity suddenly disappeared, and all of the air floated into space.) (The temperature suddenly rose to a million degrees, and everything was cooked to a crisp.) (etc.) This is one of the things that the researchers are looking for: "For how long will this model run, given the stating values used?" Atmosphere basis time mismatch Another standard message. Oops. The research assistant specified the wrong number of values somewhere. Alloc and Malloc. Memory allocation errors. This has shown up a few times, but I don't think that the problem was identified. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Thanks for the information, Les. Clearly my memory (and search skills) are lacking as I didn't realize the "time mismatch" error was the one I noted at the beginning of the thread! :D |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
astroWX: machine 1184413 seems to also be chewing up work units without any useful results... frank |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
As do 982003 and 1186330 two of my wingmen on one of my two tasks and this one 1234665 hasn't completed anything in the past year. I know this computer's record is not exemplary but it seems there are a lot of boxes out there that never seem to complete tasks....... |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Thanks for those, Frank and Dave. Message sent to Andy. Given that it is ~1945 BST, it is unlikely that anything will be done until Monday. [EDIT: What Andy does with your information: Sets max. tasks per day to -1 for the machines, which prevents more work being sent. An email is then sent to the owner stating what was done and requesting that they fix the problem, then post on these boards to tell us the fix was made, or to ask questions. When fixed, Andy resets the machine's account to allow more work. Unfortunately, not many of the people post, as compared to the number of emails sent.] [EDIT2: Neglected to mention: '1234665' had two successes of nine tasks. 'Anonymous' seems to have given-up in February, so nothing done on that one.] Copy: Hi, Andy, "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
astroWX: machine 1270234 seems to be spinning its wheels also... frank |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Good catch, Frank. Thanks. It's reported to Andy at the head-shed. There is now a thread for folks willing to dig-out and report ne'er-do-well machines (so we can keep them together). http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7674 "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 4 Oct 05 Posts: 12 Credit: 610,967 RAC: 0 |
My computer, 1281635, has not completed a cpdn workunit in recent memory. Failure is wildly different per unit, and I have not been able to identify a pattern. I've always assumed that the work completed was still valuable, so I let it keep chugging away. Way back when I was a student, I used to back up my tasks, and restore them after an error, but I just don't have time to manage that anymore. If someone wants to take a look and tell me if I should just abandon the project, I'd be interested to hear it. (This is something that only affects this project. My pc is otherwise stable and goes for weeks without reboot.) Thanks! (edit: heh, I guess the second-most-recent task I got completed successfully. I just assumed it didn't since it was so short. nevertheless, it's still true that the vast majority of my tasks fail.) |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
My computer, 1281635, has not completed a cpdn workunit in recent memory. ... Refreshing your memory ... the project has been happy with this one :-) p.s.: CPDN results are sometimes hard stuff, so a (basically) reliable host that does not trash workunit after workunit within really short time should always have a chance to complete some results. If I were you, I would keep trying (I trashed way more than a handful too btw.). |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
uioped1 - Until someone smarter than me tells you different - The work you are doing and sending (each trickle) is valuable even if the model doesn't run all the way to completion. One of your results I looked at had a "bad theta" which is not your problem. I do notice that your other results get suspended a lot. There might be some parameter you have set in BOINC that is causing this - for example, if you have CPU usage set to less than 100%. These models don't like to be interrupted even though technically that should be OK. Make sure you have "leave applications in memory when suspended" OFF. If you absolutely have to shut BOINC down or turn off your computer, go to the Projects tab and suspend ClimatePrediction". This will save everything to disk. Then shut BOINC down. Then turn your computer off. |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
... Not in all cases - a box with 10GB physical RAM that runs 24/7 can affort to leave the stuff in memory and a restart from checkpoint is always more risky than just restarting a suspended task. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I personally would use the following settings: * Suspend work if CPU usage above % 0 (i.e., do not suspend) * Leave tasks in memory while suspended? Yes * Suspend work while computer is in use? No As WB8ILI says, when you are shutting down your PC, first suspend boinc, wait a few moments, then shut down Boinc. This gives the models a chance to shut down cleanly rather than being killed by Windows during the shutdown process. Similarly, if you are about to do something intensive on the PC (for example, gaming), then it is a good idea to shut down boinc then also. I'm a volunteer and my views are my own. News and Announcements and FAQ |
©2024 cpdn.org