climateprediction.net (CPDN) home page
Thread 'Computation errors, various exit statuses'

Thread 'Computation errors, various exit statuses'

Message boards : Number crunching : Computation errors, various exit statuses
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user677346

Send message
Joined: 16 Apr 12
Posts: 6
Credit: 19,102
RAC: 0
Message 44089 - Posted: 26 Apr 2012, 8:35:12 UTC

Hi all,

My computer (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1213585) is yet to get anywhere near finishing a single task (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1213585).

The two most recent tasks are the only two that I have received credit for, so I was hopeful that I could see them finish, but alas, no avail (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14491527 and http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14491526). These two tasks ended with the exit code of -226.

Here is a sample from the error log around the time that they were reported "completed:"

Thu Apr 26 04:47:54 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file
Thu Apr 26 04:47:54 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 04:47:54 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4w_1983_1_007948687_0 using hadam3p_pnw version 609 in slot 1
Thu Apr 26 04:47:55 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file
Thu Apr 26 04:47:55 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 04:47:55 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0
Thu Apr 26 04:48:14 2012 | | Resuming computation
Thu Apr 26 04:48:15 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name
Thu Apr 26 04:49:25 2012 | climateprediction.net | Sending scheduler request: To fetch work.
Thu Apr 26 04:49:25 2012 | climateprediction.net | Requesting new tasks for NVIDIA
Thu Apr 26 04:49:57 2012 | climateprediction.net | Computation for task hadam3p_pnw_cc4v_1982_1_007948686_0 finished
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_3.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_4.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_5.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_6.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_7.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_8.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_9.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_10.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_11.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_12.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:57 2012 | climateprediction.net | Output file hadam3p_pnw_cc4v_1982_1_007948686_0_13.zip for task hadam3p_pnw_cc4v_1982_1_007948686_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Computation for task hadam3p_pnw_cc4w_1983_1_007948687_0 finished
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_3.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_4.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_5.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_6.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_7.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_8.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_9.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_10.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_11.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_12.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:49:59 2012 | climateprediction.net | Output file hadam3p_pnw_cc4w_1983_1_007948687_0_13.zip for task hadam3p_pnw_cc4w_1983_1_007948687_0 absent
Thu Apr 26 04:50:13 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name
Thu Apr 26 04:52:09 2012 | climateprediction.net | Sending scheduler request: To fetch work.
Thu Apr 26 04:52:09 2012 | climateprediction.net | Reporting 2 completed tasks, requesting new tasks for CPU and NVIDIA
Thu Apr 26 04:52:57 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name




Most others ended with error codes of 0.

I should add that the task list on the account page shows 4 in progress tasks, but my computer (core 2 duo p8700) only supports 2 concurrent tasks. I haven't seen or heard from two of those tasks in a while, and I'd be just as happy to see them erased off my list.

I'd like to fix the underlying cause of these errors so that I can actually see a task to completion. Many thanks in advance for your advice.

Philip
ID: 44089 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44090 - Posted: 26 Apr 2012, 9:11:42 UTC - in response to Message 44089.  

I have never used a Mac so these are only general pointers.
1. Always suspend tasks and exit BOINC before switching computer off.
2. Make sure that the BOINC data directory is excluded from any virus canning software.
3. Suspend tasks before doing any other work that is cpu intensive, e.g. video rendering.
Probably also worth running memtest on your computer for a few hours just to exclude memory problems as a cause. - This flagged up a problem for me on a now non existent linux box.
Others with more experience of Macs and who know more about the error codes than I do may be able to offer more specific advice.

Dave
ID: 44090 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44091 - Posted: 26 Apr 2012, 9:17:06 UTC - in response to Message 44090.  

Also worth checking whether you have upgraded BOINC version. Some upgrades crashed all models shortly after starting. Can't remember the reason but detaching and reattaching to CPDN resolves the problem. There are posts on the MAC section of this board that explain the bits I have forgotten about this.

Dave
ID: 44091 · Report as offensive     Reply Quote
old_user677346

Send message
Joined: 16 Apr 12
Posts: 6
Credit: 19,102
RAC: 0
Message 44092 - Posted: 26 Apr 2012, 9:39:04 UTC

Hey Dave,

Thanks for the response. With regards to your suggestions:

1. I'm on a Mac OSX 10.6.8 laptop which I almost never turn off. I also try to avoid having it go to sleep as much as possible. However, it's also the computer I use at work, so at least twice per day it's going to have to go to sleep for transportation. I also set boinc not to run while on battery power, which is an occassional occurrance.

I do not suspend tasks/exit boinc before having the computer go to sleep - I thought this was a windows only problem. Do other mac users suggest I do this? Is this an issue specific to CPDN or to all of BOINC?


2. I've already excluded BOINC data directory from antivirus scans and time machine backups, so I can rule those out as the causes.

3. I usually do suspend tasks before doing mem-intensive things, but this is a rare occurance.

4. I ran a mem test a couple of weeks ago and it checked out fine.

5. I'm running the latest version of BOINC.

Thanks, again,
Philip
ID: 44092 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44093 - Posted: 26 Apr 2012, 9:51:05 UTC - in response to Message 44092.  

Hi Philip,

I can't be 100% sure that the suspending applies to macs. It does to linux and since sticking to this rule I have drastically cut down the number of failures here. The way to test I guess would be to see if the failures mostly occur shortly after restarting the machine or shortly after waking it up.

Dave
ID: 44093 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 44095 - Posted: 26 Apr 2012, 13:43:48 UTC

The recommendation to suspend tasks manually before shutting the machine down doesn't really depend on the operating system: climate models do a lot of writing to disk and aren't quite up to industrial standard when interrupted, so it's advisable to allow the tasks to wrap-up to their own satisfaction rather than using the operating system's more rapid coercive closedown.
ID: 44095 · Report as offensive     Reply Quote
old_user677346

Send message
Joined: 16 Apr 12
Posts: 6
Credit: 19,102
RAC: 0
Message 44097 - Posted: 26 Apr 2012, 18:46:42 UTC - in response to Message 44095.  

Fair enough, but in the event that I don't (or my gf doesn't) turn off boinc before closing the laptop and a mistake occurs, shouldn't it go back to the last checkpoint?

Also, I found that this error -226 means that the project failed too many times before reaching its next checkpoint. But I also saw that the log message "Result '(result)' exited with zero status but no 'finished' file" can mean that the task exited because of a lack of heartbeat. In fact, in the stderr of those two tasks I listed, there appears to be a heartbeat error.

If there is no heartbeat from BOINC, shouldn't the project then shut down and not restart until instructed to do so by the client, thus preventing -226 from happening? (unless if perhaps the client is sending messages to restart the project but then failing to send a heartbeat).

So were two forms of built-in backup/error-correction bypassed here?
ID: 44097 · Report as offensive     Reply Quote
old_user677346

Send message
Joined: 16 Apr 12
Posts: 6
Credit: 19,102
RAC: 0
Message 44098 - Posted: 26 Apr 2012, 18:59:53 UTC - in response to Message 44097.  

From the log, at the start of when the shit hit the fan:


Thu Apr 26 00:35:07 2012 | climateprediction.net | Sending scheduler request: To fetch work.
Thu Apr 26 00:35:07 2012 | climateprediction.net | Requesting new tasks for NVIDIA
Thu Apr 26 00:35:40 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file
Thu Apr 26 00:35:40 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 00:35:40 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0
Thu Apr 26 00:35:41 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file
Thu Apr 26 00:35:41 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 00:36:03 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name
Thu Apr 26 00:36:10 2012 | | Project communication failed: attempting access to reference site
Thu Apr 26 00:36:32 2012 | | BOINC can't access Internet - check network connection or proxy configuration.
Thu Apr 26 00:37:24 2012 | climateprediction.net | Sending scheduler request: To fetch work.
Thu Apr 26 00:37:24 2012 | climateprediction.net | Requesting new tasks for NVIDIA
Thu Apr 26 00:37:56 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file
Thu Apr 26 00:37:56 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 00:37:56 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4w_1983_1_007948687_0 using hadam3p_pnw version 609 in slot 1
Thu Apr 26 00:37:58 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file
Thu Apr 26 00:37:58 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 00:37:58 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0
Thu Apr 26 00:38:19 2012 | climateprediction.net | Scheduler request failed: Couldn't resolve host name
Thu Apr 26 00:39:34 2012 | climateprediction.net | Sending scheduler request: To fetch work.
Thu Apr 26 00:39:34 2012 | climateprediction.net | Requesting new tasks for NVIDIA
Thu Apr 26 00:40:06 2012 | climateprediction.net | Task hadam3p_pnw_cc4v_1982_1_007948686_0 exited with zero status but no 'finished' file
Thu Apr 26 00:40:06 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.
Thu Apr 26 00:40:06 2012 | climateprediction.net | Restarting task hadam3p_pnw_cc4v_1982_1_007948686_0 using hadam3p_pnw version 609 in slot 0
Thu Apr 26 00:40:08 2012 | climateprediction.net | Task hadam3p_pnw_cc4w_1983_1_007948687_0 exited with zero status but no 'finished' file
Thu Apr 26 00:40:08 2012 | climateprediction.net | If this happens repeatedly you may need to reset the project.



"Scheduler request failed: Couldn't resolve host name" appears sandwiched between all of the "task exited" errors between that time and when the project finally died. Alternative hypothesis: crash caused by a lack of internet connectivity?

I live in an area of limited internet connection. I don't have dial-up, but the broadband internet connection itself comes and goes on a whim (developing-world connectivity). Two questions, if we are to accept this alternative hypothesis:
1) Is it really possible that a bad internet connection will crash a project?
2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update."
ID: 44098 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 44099 - Posted: 26 Apr 2012, 20:01:13 UTC - in response to Message 44098.  

2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update."

Sure; I run like that all the time. In BOINC Manager, select Activity and then select NETWORK ACTIVITY SUSPENDED. That will do what it says on the tin. Then you can choose when to return it to NETWORK ACTIVITY ALWAYS AVAILABLE in order to permit any updates, trickles, uploads, downloads to complete and then go back to SUSPENDED till the next time.
ID: 44099 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44101 - Posted: 26 Apr 2012, 21:32:55 UTC

Fair enough, but in the event that I don't (or my gf doesn't) turn off boinc before closing the laptop and a mistake occurs, shouldn't it go back to the last checkpoint?

The programs used here have a large number of files open, a lot of which need to be saved, or check pointed, as a group. If the computer shuts down when the program is in the middle of saving these files, the checkpoint will be corrupted, as some files will be out of sync with the others.
And so the program crashes.

This has been posted many times over the years, as has the advice to make regular back ups in case of power failure. (Which includes hibernating the computer.)
There's even a section on backups in the README posts section of our other board, and a link to it in my sig.

"Result '(result)' exited with zero status but no 'finished' file" can mean that the task exited because of a lack of heartbeat.

...

If there is no heartbeat from BOINC, shouldn't the project then shut down and not restart until instructed to do so by the client


BOINC is in 2 parts:
1) The "client", which is the part that does the work and is invisible to users,
2) The Manager/gui, who's job is to look at what the client is doing, and display this to the user.

These 2 parts communicate to each other 10 times per second, and is referred to as the "heartbeat".

I also notice a lot of Suspend request from BOINC in the error messages. This is most likely because you have the default setting for Suspend work if CPU usage is above , and will result in the program stopping and starting a lot. These programs are from the UK Met Office, where they run on supercomputers. They just aren't designed for this constant interruption.

If there is no heartbeat from BOINC, shouldn't the project then shut down

As mentioned above, 'heartbeat' is a BOINC function. And BOINC doesn't stop the science applications when it's having problems. It's just a 'traffic router', deciding which project's applications should get run next on multi-project computers.

The reason that you haven't received credit for a lot of your models, is that credit is based on trickle_up files received by the project's servers, and a lot of your models have crashed before they've gotten anywhere near creating trickles. (Which contain small amounts of data for the project. Larger chunks are contained in zip files.)




Backups: Here
ID: 44101 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 44102 - Posted: 26 Apr 2012, 21:47:14 UTC - in response to Message 44099.  

2) Is there anyway to have BOINC use the internet ONLY when I specify, even though I'm (nominally) on broadband? And by only when I specifiy, I don't mean just at certain times a day: I mean only when I click "update."

Sure; I run like that all the time. In BOINC Manager, select Activity and then select NETWORK ACTIVITY SUSPENDED. That will do what it says on the tin. Then you can choose when to return it to NETWORK ACTIVITY ALWAYS AVAILABLE in order to permit any updates, trickles, uploads, downloads to complete and then go back to SUSPENDED till the next time.

There are a number of situations where that kind of approach is advisable - and not only where the Internet connection is unreliable. For example, the only reliable method I have found of finishing HADCM3N models is to let them run without interrupution for the whole 40 years - so I make a backup before the model starts, run the model with network activity off, then turn network activity on again only when the model has finished and upload with four zips safely generated. 100% zip rate so far with that protocol, which is a distinct improvement on my previous record.

However, an intermittent Internet connection won't crash a running model. For Mac users who upgrade BOINC it is advisable to reset the project before running a new model. One, at least, of Philip's models had an 'execl' error ...

execl(/Library/Application Support/BOINC Data/projects/climateprediction.net/hadam3p_pnw_um_6.09_i686-apple-darwin, 181970) failed!

... which is usually attributed to a permissions problem caused by the upgrade (or a restore from backup). There is a new version of the Mac application in the pipeline that fixes the permissions problem, possibly waiting for a new region to be added to PNW, EU and SAF before being released ...
ID: 44102 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44103 - Posted: 26 Apr 2012, 22:26:14 UTC - in response to Message 44098.  

To only connect when you want it to you can click on activity, and then suspend network activity. Then you can just resume network activity when you want it to talk to the project.
ID: 44103 · Report as offensive     Reply Quote
old_user677346

Send message
Joined: 16 Apr 12
Posts: 6
Credit: 19,102
RAC: 0
Message 44266 - Posted: 29 May 2012, 21:06:06 UTC

Huzzah! My computer has successfully completed its first two work units. I've had no problems since limiting network access to just 30 minutes a day, during a time period that I know my connection is good.
ID: 44266 · Report as offensive     Reply Quote

Message boards : Number crunching : Computation errors, various exit statuses

©2024 cpdn.org