Questions and Answers : Unix/Linux : Exit status 193 compute error
Message board moderation
Author | Message |
---|---|
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
Hello, I've been meaning to address this issue for quite some time time, but I've just now found the time for it. I'm running hadcm3n workunits on my linux mint distro and every one of them crashes unexpectedly somewhere between 2 and 10% giving exit status 193. Google hasn't been very helpful in diagnosing this. The only helpful thing I've been able to find about this exit code is in a list of Windows system error codes where it says that the application is not a valid Win32 application (but I don't need it to be, I'm running on linux!). stderr looks normal, I guess, apart from exiting with signal 3, which is SIGQUIT, but again, doesn't tell me much. Could you please look into this and tell me what's wrong? Is this standard behaviour for these wus? 98% of all tasks from other projects finish without errors, but tasks from climateprediction always crash. I'd like them to finish cleanly. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,024,725 RAC: 20,592 |
Clicking on the + symbol next to stderr shows lots of Suspended CPDN Monitor - Suspend request from BOINC... messages This makes me wonder about your BOINC settings. Ensure under the tools menu while computer is in use is ticked. The CPDN models don't like lots of starting and stopping. Also it may be worth suspending computation before doing anything very processor intensive as that can throw a spanner into the works. I seem to remember reading that the exit codes have different meanings under Linux. |
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
Thanks for replying. I've checked and computing is allowed while computer is in use. Also, if you'd like to know other settings, computing is allowed while processor usage is less than 70% and it it is set to use at most 60% of CPU time (90 nm old cpu heats up fast if boinc is left with default 100% cpu usage). I've set it that way because I don't have a lot of cpu intensive apps, and I suspend computations before using flash or java sockets or anything that is CPU intensive, so no worries here. I always thought that what we're seeing in stderr comes not from unusually many starts and stops, but from the fact that the running time of CPDN is very long. Usually the deadline is 3 months away and each of those tasks ran for a minimum of a week before it crashed, so I don't think that there are that many stops there, but rather distributed over a large timespan. About that exit code, yes, I said that it should have a different meaning under linux, but I can't find it... |
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
Just found this problem has been discussed here before: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7306#43193 Dagorath defines the error there. It seems that the task is writing in non-permitted memory zones or there's a problem with the RAM or hard drive. It's been a while since I ran a memtest but I have no reason to suspect that either of them are faulty right now. It's important to note most people there were talking about fail to success ratios, but none of my CPDN tasks have gotten past 10% I've just upgraded both the boinc version and the distro. If the problems persist, I'll try Greg's suggestions. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
There are just way too many suspend requests for only 1 to 3 trickles being returned. It could be that given the speed of the computer, the CPU usage setting, and the number of projects it is attached to, there are just too many opportunities for something bad to happen at key times. |
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
How does the CPU usage setting affect it? Sure, it's 60%, so it must pause the calculations sometimes to maintain this average usage, but the pause is likely done through the usual pre-empting, it probably saves the context, than restores it at a later time. As long as it has enough memory (and it has), that shouldn't be a problem. The OS does that all the time when a program's time on the cpu expires and another program needs to run. Besides, some of those fails have been when the system was in idle and boinc had 100% of cpu time for itself, so that doesn't hold, imho. Given that other linux (and some windows) users have experienced the same error, has any developer looked into the code so we can definitely rule out a bug from inside? As I said before, 98% of all other tasks from all other projects don't segfault. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If the pausing happens while the model is check pointing, then part of the saved values will be current, and some will be from the previous checkpoint. So the checkpoint will be corrupted. ************ has any developer looked into the code The code for the climate models belongs to the UK Met Office. It's been developed over many years by many climatologists and software engineers, to run on their super computers. It was never intended to be frequently stopped and started during it's running. It's also said to be close to a million lines of source code. And the project people don't have the source code for the main program, just the auxiliary programs. The people whose work is being run here, are external to Oxford. They work in climate centres in various places around the world. Backups: Here |
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
Regarding the tasks not being designed to be frequently started and stopped, establishing on what hardware and operating systems your application can be modified to run on, is probably part of initial design. Frequent starting and stopping comes with the design of the boinc systen and all projects are adapted to that. Since the 8th of august, I've only had 4 or 5 tasks non-related to climate prediction crashing. Moreover I've set the upper limit of 70 % cpu usage by non-boinc software specifically so the tasks won't get suspended frequently. By the way, most of those pauses aren't reported by the GUI because I don't see "CPU busy" 10 times per hour. I know this problem doesn't happen to a lot of users, but you can analyse to whom it happens. If you could filter the results that gave computation errors by cause, you might notice some commonalities either in hardware or in settings for those clients whose tasks have crashed, then you could give some general guidelines (have cpu usage at x%) and so on. I can't experiment enough with that because hacm3n estimates it needs 997 hours to complete, so I don't have enough data points to make a statistic, but you do. |
Send message Joined: 17 Aug 13 Posts: 6 Credit: 2,378 RAC: 0 |
And I've imagined that the best climatologists and software engineers have been working on that code for years, but a pointer that points where it shouldn't in a very very rare use case is enough to ruin otherwise brilliant work. I'm not saying that's definitely the case but it can happen. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
... then you could give some general guidelines ...We have. As for a survey, it's been noted many times that those having problems are the ones NOT allowing BOINC to run without restriction. There are far more computers than are really needed for this project, so a few computers not being able to complete models doesn't matter. They will just be re-issued, and sooner or later will be run by a computer that doesn't have a high crash record. The job of the two project people are to keep the servers running, and to produce results for the researchers. This is happening, with lots of computers to spare. Backups: Here |
©2024 cpdn.org