Thread 'Compute Errors / Bad Work Units?'

Author	Message
ritterm Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0	Message 46831 - Posted: 21 Aug 2013, 20:55:37 UTC It looks like there are some recently geerated WU's out there that are bad: 8559375 8559214 Most hosts are crashing shortly after the start with similar stderr output: <core_client_version>7.0.64</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch 2048 . . . Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> ID: 46831 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46835 - Posted: 21 Aug 2013, 21:33:35 UTC - in response to Message 46831. Oh great. OK, another email. :( Thanks for the heads up. ID: 46835 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 46902 - Posted: 28 Aug 2013, 4:12:39 UTC - in response to Message 46835. Same with me on: WU 8559724 Task 15941889 It ran a total of 22secs and other peoples tasks also failed. 3 immediately, one ofter 30mins. Regards, Martin ID: 46902 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 46930 - Posted: 1 Sep 2013, 13:13:58 UTC ...my machine made it to 96% on work unit 8537815 before keeling over with an exit code of 22...i think it was perhaps OS confusion caused by other things that were running, but don't know for sure...other machines gave up much sooner... while looking at other folks progress on this work unit, i noticed that machine 1105670 has crashed on nearly every work unit it has attempted...any obvious reason for this situation ??? frank ID: 46930 · Reply Quote

Bellator Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0	Message 46931 - Posted: 1 Sep 2013, 15:07:58 UTC - in response to Message 46930. For what it's worth: 8549908 has been running for well over 400 hours and I have at least 20 trickles. Just lucky, I guess... ID: 46931 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524	Message 46933 - Posted: 1 Sep 2013, 16:17:11 UTC My best guess is that machine 1105670 needs to reset the project. ID: 46933 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 46934 - Posted: 1 Sep 2013, 18:54:12 UTC Last modified: 1 Sep 2013, 18:55:19 UTC Old Athlon XP, not compatible with HadCM3N tasks. Thanks for pointing it out! I'll notify Andy so he can cut the machine's water off. [Edited for typo.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 46934 · Reply Quote

ritterm Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0	Message 46935 - Posted: 1 Sep 2013, 19:15:23 UTC These might be "typical" model crashes, but the stderr output is different than I've seen in my limited CPDN experience. Just pointing them out in case they are new/relevant. Task 15942401 -- Model crashed: ATM_DYN : INVALID THETA DETECTED. Task 15930887 -- Model crashed: INITTIME: Atmosphere basis time mismatch Task 15899492 -- terminate called after throwing an instance of 'St9bad_alloc' ID: 46935 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46936 - Posted: 1 Sep 2013, 20:21:46 UTC - in response to Message 46935. INVALID THETA A standard message. Means that the physical environment became impossible. (The gravity suddenly disappeared, and all of the air floated into space.) (The temperature suddenly rose to a million degrees, and everything was cooked to a crisp.) (etc.) This is one of the things that the researchers are looking for: "For how long will this model run, given the stating values used?" Atmosphere basis time mismatch Another standard message. Oops. The research assistant specified the wrong number of values somewhere. Alloc and Malloc. Memory allocation errors. This has shown up a few times, but I don't think that the problem was identified. ID: 46936 · Reply Quote

ritterm Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0	Message 46937 - Posted: 2 Sep 2013, 3:10:06 UTC Thanks for the information, Les. Clearly my memory (and search skills) are lacking as I didn't realize the "time mismatch" error was the one I noted at the beginning of the thread! :D ID: 46937 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 46977 - Posted: 6 Sep 2013, 8:32:32 UTC astroWX: machine 1184413 seems to also be chewing up work units without any useful results... frank ID: 46977 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,008,987 RAC: 21,524	Message 46978 - Posted: 6 Sep 2013, 13:22:58 UTC As do 982003 and 1186330 two of my wingmen on one of my two tasks and this one 1234665 hasn't completed anything in the past year. I know this computer's record is not exemplary but it seems there are a lot of boxes out there that never seem to complete tasks....... ID: 46978 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 46981 - Posted: 6 Sep 2013, 18:50:12 UTC Last modified: 6 Sep 2013, 19:13:55 UTC Thanks for those, Frank and Dave. Message sent to Andy. Given that it is ~1945 BST, it is unlikely that anything will be done until Monday. [EDIT: What Andy does with your information: Sets max. tasks per day to -1 for the machines, which prevents more work being sent. An email is then sent to the owner stating what was done and requesting that they fix the problem, then post on these boards to tell us the fix was made, or to ask questions. When fixed, Andy resets the machine's account to allow more work. Unfortunately, not many of the people post, as compared to the number of emails sent.] [EDIT2: Neglected to mention: '1234665' had two successes of nine tasks. 'Anonymous' seems to have given-up in February, so nothing done on that one.] Copy: Hi, Andy, Three more brought to our attention by participants: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1186330 * http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=982003 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1184413 * * Mix of troubles; completes HadAM3P tasks Some success in the past but all now have SIGSEGV errors. * Darwin issue "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 46981 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 47180 - Posted: 27 Sep 2013, 1:26:39 UTC astroWX: machine 1270234 seems to be spinning its wheels also... frank ID: 47180 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 47181 - Posted: 27 Sep 2013, 5:36:32 UTC - in response to Message 47180. Good catch, Frank. Thanks. It's reported to Andy at the head-shed. There is now a thread for folks willing to dig-out and report ne'er-do-well machines (so we can keep them together). http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7674 "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 47181 · Reply Quote

old_user101069 Send message Joined: 4 Oct 05 Posts: 12 Credit: 610,967 RAC: 0	Message 47209 - Posted: 30 Sep 2013, 17:33:15 UTC Last modified: 30 Sep 2013, 17:35:40 UTC My computer, 1281635, has not completed a cpdn workunit in recent memory. Failure is wildly different per unit, and I have not been able to identify a pattern. I've always assumed that the work completed was still valuable, so I let it keep chugging away. Way back when I was a student, I used to back up my tasks, and restore them after an error, but I just don't have time to manage that anymore. If someone wants to take a look and tell me if I should just abandon the project, I'd be interested to hear it. (This is something that only affects this project. My pc is otherwise stable and goes for weeks without reboot.) Thanks! (edit: heh, I guess the second-most-recent task I got completed successfully. I just assumed it didn't since it was so short. nevertheless, it's still true that the vast majority of my tasks fail.) ID: 47209 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 47212 - Posted: 30 Sep 2013, 18:26:06 UTC - in response to Message 47209. Last modified: 30 Sep 2013, 18:30:06 UTC My computer, 1281635, has not completed a cpdn workunit in recent memory. ... Refreshing your memory ... the project has been happy with this one :-) p.s.: CPDN results are sometimes hard stuff, so a (basically) reliable host that does not trash workunit after workunit within really short time should always have a chance to complete some results. If I were you, I would keep trying (I trashed way more than a handful too btw.). ID: 47212 · Reply Quote

WB8ILI Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164	Message 47213 - Posted: 30 Sep 2013, 18:29:25 UTC uioped1 - Until someone smarter than me tells you different - The work you are doing and sending (each trickle) is valuable even if the model doesn't run all the way to completion. One of your results I looked at had a "bad theta" which is not your problem. I do notice that your other results get suspended a lot. There might be some parameter you have set in BOINC that is causing this - for example, if you have CPU usage set to less than 100%. These models don't like to be interrupted even though technically that should be OK. Make sure you have "leave applications in memory when suspended" OFF. If you absolutely have to shut BOINC down or turn off your computer, go to the Projects tab and suspend ClimatePrediction". This will save everything to disk. Then shut BOINC down. Then turn your computer off. ID: 47213 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 47214 - Posted: 30 Sep 2013, 18:33:22 UTC - in response to Message 47213. ... Make sure you have "leave applications in memory when suspended" OFF. ... Not in all cases - a box with 10GB physical RAM that runs 24/7 can affort to leave the stuff in memory and a restart from checkpoint is always more risky than just restarting a suspended task. ID: 47214 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 47215 - Posted: 30 Sep 2013, 18:51:07 UTC I personally would use the following settings: * Suspend work if CPU usage above % 0 (i.e., do not suspend) * Leave tasks in memory while suspended? Yes * Suspend work while computer is in use? No As WB8ILI says, when you are shutting down your PC, first suspend boinc, wait a few moments, then shut down Boinc. This gives the models a chance to shut down cleanly rather than being killed by Windows during the shutdown process. Similarly, if you are about to do something intensive on the PC (for example, gaming), then it is a good idea to shut down boinc then also. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 47215 · Reply Quote