Message boards : Number crunching : Computation errors, various exit statuses
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Apr 12 Posts: 6 Credit: 19,102 RAC: 0 |
Hi all, My computer (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1213585) is yet to get anywhere near finishing a single task (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1213585). The two most recent tasks are the only two that I have received credit for, so I was hopeful that I could see them finish, but alas, no avail (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14491527 and http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14491526). These two tasks ended with the exit code of -226. Here is a sample from the error log around the time that they were reported "completed:" Thu Apr 26 04:47:54 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file Most others ended with error codes of 0. I should add that the task list on the account page shows 4 in progress tasks, but my computer (core 2 duo p8700) only supports 2 concurrent tasks. I haven't seen or heard from two of those tasks in a while, and I'd be just as happy to see them erased off my list. I'd like to fix the underlying cause of these errors so that I can actually see a task to completion. Many thanks in advance for your advice. Philip |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have never used a Mac so these are only general pointers. 1. Always suspend tasks and exit BOINC before switching computer off. 2. Make sure that the BOINC data directory is excluded from any virus canning software. 3. Suspend tasks before doing any other work that is cpu intensive, e.g. video rendering. Probably also worth running memtest on your computer for a few hours just to exclude memory problems as a cause. - This flagged up a problem for me on a now non existent linux box. Others with more experience of Macs and who know more about the error codes than I do may be able to offer more specific advice. Dave |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Also worth checking whether you have upgraded BOINC version. Some upgrades crashed all models shortly after starting. Can't remember the reason but detaching and reattaching to CPDN resolves the problem. There are posts on the MAC section of this board that explain the bits I have forgotten about this. Dave |
Send message Joined: 16 Apr 12 Posts: 6 Credit: 19,102 RAC: 0 |
Hey Dave, Thanks for the response. With regards to your suggestions: 1. I'm on a Mac OSX 10.6.8 laptop which I almost never turn off. I also try to avoid having it go to sleep as much as possible. However, it's also the computer I use at work, so at least twice per day it's going to have to go to sleep for transportation. I also set boinc not to run while on battery power, which is an occassional occurrance. I do not suspend tasks/exit boinc before having the computer go to sleep - I thought this was a windows only problem. Do other mac users suggest I do this? Is this an issue specific to CPDN or to all of BOINC? 2. I've already excluded BOINC data directory from antivirus scans and time machine backups, so I can rule those out as the causes. 3. I usually do suspend tasks before doing mem-intensive things, but this is a rare occurance. 4. I ran a mem test a couple of weeks ago and it checked out fine. 5. I'm running the latest version of BOINC. Thanks, again, Philip |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Hi Philip, I can't be 100% sure that the suspending applies to macs. It does to linux and since sticking to this rule I have drastically cut down the number of failures here. The way to test I guess would be to see if the failures mostly occur shortly after restarting the machine or shortly after waking it up. Dave |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
The recommendation to suspend tasks manually before shutting the machine down doesn't really depend on the operating system: climate models do a lot of writing to disk and aren't quite up to industrial standard when interrupted, so it's advisable to allow the tasks to wrap-up to their own satisfaction rather than using the operating system's more rapid coercive closedown. |
Send message Joined: 16 Apr 12 Posts: 6 Credit: 19,102 RAC: 0 |
Fair enough, but in the event that I don't (or my gf doesn't) turn off boinc before closing the laptop and a mistake occurs, shouldn't it go back to the last checkpoint? Also, I found that this error -226 means that the project failed too many times before reaching its next checkpoint. But I also saw that the log message "Result '(result)' exited with zero status but no 'finished' file" can mean that the task exited because of a lack of heartbeat. In fact, in the stderr of those two tasks I listed, there appears to be a heartbeat error. If there is no heartbeat from BOINC, shouldn't the project then shut down and not restart until instructed to do so by the client, thus preventing -226 from happening? (unless if perhaps the client is sending messages to restart the project but then failing to send a heartbeat). So were two forms of built-in backup/error-correction bypassed here? |
Send message Joined: 16 Apr 12 Posts: 6 Credit: 19,102 RAC: 0 |
From the log, at the start of when the shit hit the fan: Thu Apr 26 00:35:07 2012 | climateprediction.net | Sending scheduler request: To fetch work. Thu Apr 26 00:35:07 2012 | climateprediction.net | Requesting new tasks for NVIDIA Thu Apr 26 00:35:40 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file Thu Apr 26 00:35:40 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. Thu Apr 26 00:35:40 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0 Thu Apr 26 00:35:41 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file Thu Apr 26 00:35:41 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. Thu Apr 26 00:36:03 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name Thu Apr 26 00:36:10 2012 | | Project communication failed: attempting access to reference site Thu Apr 26 00:36:32 2012 | | BOINC can't access Internet - check network connection or proxy configuration. Thu Apr 26 00:37:24 2012 | climateprediction.net | Sending scheduler request: To fetch work. Thu Apr 26 00:37:24 2012 | climateprediction.net | Requesting new tasks for NVIDIA Thu Apr 26 00:37:56 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file Thu Apr 26 00:37:56 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. Thu Apr 26 00:37:56 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4w_1983_1_007948687_0 using hadam3p_pnw version 609 in slot 1 Thu Apr 26 00:37:58 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file Thu Apr 26 00:37:58 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. Thu Apr 26 00:37:58 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0 Thu Apr 26 00:38:19 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name Thu Apr 26 00:39:34 2012 | climateprediction.net | Sending scheduler request: To fetch work. Thu Apr 26 00:39:34 2012 | climateprediction.net | Requesting new tasks for NVIDIA Thu Apr 26 00:40:06 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file Thu Apr 26 00:40:06 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. Thu Apr 26 00:40:06 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0 Thu Apr 26 00:40:08 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file Thu Apr 26 00:40:08 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project. "Scheduler request failed: Couldn't resolve host name" appears sandwiched between all of the "task exited" errors between that time and when the project finally died. Alternative hypothesis: crash caused by a lack of internet connectivity? I live in an area of limited internet connection. I don't have dial-up, but the broadband internet connection itself comes and goes on a whim (developing-world connectivity). Two questions, if we are to accept this alternative hypothesis: 1) Is it really possible that a bad internet connection will crash a project? 2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update." |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update." Sure; I run like that all the time. In BOINC Manager, select Activity and then select NETWORK ACTIVITY SUSPENDED. That will do what it says on the tin. Then you can choose when to return it to NETWORK ACTIVITY ALWAYS AVAILABLE in order to permit any updates, trickles, uploads, downloads to complete and then go back to SUSPENDED till the next time. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Fair enough, but in the event that I don't (or my gf doesn't) turn off boinc before closing the laptop and a mistake occurs, shouldn't it go back to the last checkpoint? The programs used here have a large number of files open, a lot of which need to be saved, or check pointed, as a group. If the computer shuts down when the program is in the middle of saving these files, the checkpoint will be corrupted, as some files will be out of sync with the others. And so the program crashes. This has been posted many times over the years, as has the advice to make regular back ups in case of power failure. (Which includes hibernating the computer.) There's even a section on backups in the README posts section of our other board, and a link to it in my sig. "Result '(result)' exited with zero status but no 'finished' file" can mean that the task exited because of a lack of heartbeat. BOINC is in 2 parts: 1) The "client", which is the part that does the work and is invisible to users, 2) The Manager/gui, who's job is to look at what the client is doing, and display this to the user. These 2 parts communicate to each other 10 times per second, and is referred to as the "heartbeat". I also notice a lot of Suspend request from BOINC in the error messages. This is most likely because you have the default setting for Suspend work if CPU usage is above , and will result in the program stopping and starting a lot. These programs are from the UK Met Office, where they run on supercomputers. They just aren't designed for this constant interruption. If there is no heartbeat from BOINC, shouldn't the project then shut down As mentioned above, 'heartbeat' is a BOINC function. And BOINC doesn't stop the science applications when it's having problems. It's just a 'traffic router', deciding which project's applications should get run next on multi-project computers. The reason that you haven't received credit for a lot of your models, is that credit is based on trickle_up files received by the project's servers, and a lot of your models have crashed before they've gotten anywhere near creating trickles. (Which contain small amounts of data for the project. Larger chunks are contained in zip files.) Backups: Here |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update." There are a number of situations where that kind of approach is advisable - and not only where the Internet connection is unreliable. For example, the only reliable method I have found of finishing HADCM3N models is to let them run without interrupution for the whole 40 years - so I make a backup before the model starts, run the model with network activity off, then turn network activity on again only when the model has finished and upload with four zips safely generated. 100% zip rate so far with that protocol, which is a distinct improvement on my previous record. However, an intermittent Internet connection won't crash a running model. For Mac users who upgrade BOINC it is advisable to reset the project before running a new model. One, at least, of Philip's models had an 'execl' error ... execl(/Library/Application Support/BOINC Data/projects/climateprediction.net/hadam3p_pnw_um_6.09_i686-apple-darwin, 181970) failed! ... which is usually attributed to a permissions problem caused by the upgrade (or a restore from backup). There is a new version of the Mac application in the pipeline that fixes the permissions problem, possibly waiting for a new region to be added to PNW, EU and SAF before being released ... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
To only connect when you want it to you can click on activity, and then suspend network activity. Then you can just resume network activity when you want it to talk to the project. |
Send message Joined: 16 Apr 12 Posts: 6 Credit: 19,102 RAC: 0 |
Huzzah! My computer has successfully completed its first two work units. I've had no problems since limiting network access to just 30 minutes a day, during a time period that I know my connection is good. |
©2024 cpdn.org