climateprediction.net (CPDN) home page
Thread 'Domino effect leads to unrecoverable errors'

Thread 'Domino effect leads to unrecoverable errors'

Message boards : Number crunching : Domino effect leads to unrecoverable errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Marc

Send message
Joined: 12 Oct 05
Posts: 7
Credit: 65,046
RAC: 0
Message 18951 - Posted: 3 Jan 2006, 17:13:51 UTC
Last modified: 3 Jan 2006, 17:22:24 UTC

Yes yes, I know unrecoverable errors happen with climateprediction.net, and that sometimes they\'re hardware-specific, sometimes work load specific, and occasionally it\'s a bug in the particular BOINC Science Project being run.

...but I think in this case you will agree that tightening the climateprediction.net code against other applications\' errors would lessen the frequency of the error I saw.

In my case, due to not knowing BOINC very well, I ended up with three climateprediction.net workloads, two of which had been deferred. I just got into the office, unlocked my screen (I leave Outlook running 24/7 so that it can process client-side e-mail rules), and found that Outlook had basically \"zombie\"\'d, eating up all the CPU resources I have on my Windows XP Pro commercial desktop. Well, maybe it\'s just redrawing or reloading content from the Exchange server? So I left to run errands for 20-30 minutes. When I came back, I found this in my BOINC (5.2.13) log:

1/3/2006 8:25:20 AM|climateprediction.net|Unrecoverable error for result 15mu_300074520_0 (There are no child processes to wait for. (0x80) - exit code 128 (0x80))
1/3/2006 8:25:20 AM||request_reschedule_cpus: process exited
1/3/2006 8:25:21 AM|climateprediction.net|Computation for result 15mu_300074520_0 finished
1/3/2006 8:25:22 AM|climateprediction.net|Restarting result 1dht_100084807_0 using hadsm3 version 413

At this point I terminated Outlook (which freed up enough CPU resources that BOINC could resume). I detached the SETI@Home project I wasn\'t using anyway and looked back at the log,... the second of three workloads had also crashed:

1/3/2006 8:53:27 AM|SETI@home|Resetting project
1/3/2006 8:53:27 AM||request_reschedule_cpus: exit_tasks
1/3/2006 8:53:28 AM|SETI@home|Detaching from project
1/3/2006 8:53:28 AM||request_reschedule_cpus: project op
1/3/2006 8:53:55 AM|climateprediction.net|Unrecoverable error for result 1dht_100084807_0 ( - exit code -5 (0xfffffffb))
1/3/2006 8:53:55 AM||request_reschedule_cpus: process exited
1/3/2006 8:53:55 AM|climateprediction.net|Computation for result 1dht_100084807_0 finished
1/3/2006 8:53:55 AM|climateprediction.net|Restarting result 1eoq_100086367_0 using hadsm3 version 413

At this point I started reading the BOINC Wiki about unrecoverable computation errors and found a note that I should report this to the particular project affected.

Wondering how I was going to clear these errors and whether those two crashed workloads would need to be manually reported before they could be removed, I decided to try automatic handling and request a server update. The update seems to have worked to report the completed segments at least.

Still, if this is indeed a resource starvation problem like I think it is (\"There are no child processes to wait for\"), the client should have waited for resources to be free, not errored out. Right?
ID: 18951 · Report as offensive     Reply Quote
ProfileHonza
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 390
Credit: 2,475,242
RAC: 0
Message 18954 - Posted: 3 Jan 2006, 17:59:16 UTC - in response to Message 18951.  

Still, if this is indeed a resource starvation problem like I think it is (\"There are no child processes to wait for\"), the client should have waited for resources to be free, not errored out. Right?
Problem well described.
At what priority is Outlook running? Lucky me - I never used Outlook on my system but found a lot of problems with it over office computers: eating-up all CPU resources as a common one.

How long should the BOINC core (client) waited for child processes (CPDN aplication)? If 30 minutes is not enough, should it wait even more in every case (e.g. Windows shutdown/restart)? I definitely would not waited half an hour until Windows restart. There must be a timeout limit...which was propably exeeded due to Outlook resource demands.

The other problem on Windows and single-CPU/core machine is that when one process (application) with high-priority demands all available CPU resources, there are no available resources to manage the problem.
Solution might be: (i) put both applications demanding resources on equal priority so they can share resources, (ii) having more CPU unit (dual-core, dual-CPU machine) so that each CPU unit handle each CPU resource demanding application (beware of processes CPU affinity), (iii) solution on OS level.

I would first check out Outlook priority, do some maintining of Outlook (compact database, degraf disk as such application tends to fragments large files which results in slower running) etc.
What AV solution are you using? There can be a connection with e-mail client.
<i>phpBB forum for CPDN, all are </i><a href="http://www.climateprediction.net/board">invited</a>
ID: 18954 · Report as offensive     Reply Quote
Marc

Send message
Joined: 12 Oct 05
Posts: 7
Credit: 65,046
RAC: 0
Message 19064 - Posted: 5 Jan 2006, 23:55:57 UTC - in response to Message 18954.  

At what priority is Outlook running?


\"Normal\" priority.

How long should the BOINC core (client) waited for child processes (CPDN aplication)? If 30 minutes is not enough, should it wait even more in every case (e.g. Windows shutdown/restart)? I definitely would not waited half an hour until Windows restart. There must be a timeout limit...which was propably exeeded due to Outlook resource demands.


That\'s tricky, and I understand why 30 minutes seems like a long time. ...but for someone who leaves their workstation on overnight and over weekends, and is processing a workload that can easily take 3 months to complete, even 72 hours is not all that long. Even so, a timeout should not mean \"throw away the whole workload and start over\" it just means going back to the last checkpoint, right?

(iii) solution on OS level.

I think you\'re hinting at something I would agree with -- just because a process has slightly lower priority doesn\'t mean it should come to a screeching halt. Ideally Windows should probably execute tasks in a manner more like Linux. I don\'t know how anyone would change that.

I would first check out Outlook priority, do some maintining of Outlook (compact database, degraf disk as such application tends to fragments large files which results in slower running) etc.
What AV solution are you using? There can be a connection with e-mail client.


I have Symantec Anti-Virus Corporate Edition running, and despite my IT department warning me not to install \"unauthorized patches\" I\'ve gone ahead and run Office Update to get everything up to where Microsoft thinks they should be. Though Outlook seems stable at the moment, errors have increased. Especially errors like this:
1/4/2006 4:25:13 PM|climateprediction.net|Unrecoverable error for result sulphur_igu2_000861626_0 (<file_xfer_error> <file_name>sulphur_igu2_000861626_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_igu2_000861626_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_igu2_000861626_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_igu2_000861626_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_igu2_000861626_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>)

ID: 19064 · Report as offensive     Reply Quote
ProfileHonza
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 390
Credit: 2,475,242
RAC: 0
Message 19074 - Posted: 6 Jan 2006, 9:46:12 UTC

Marc, error -161 is a known problem cause by a buggy WU batch
http://www.climateprediction.net/board/viewtopic.php?p=32429#32429

Checkpoints on CPDN are saved every 144 timestep (3 model days).

I have never seen CPDN \"come to a screeching halt\" but seen Outlook doing this several times (sic).

I\'m running boxes on 24/7 basis and their are not only CPDN exclusive - on servers as an e-mail/file server, another for accounting, my main box runs Photoshop, Quark and other 2D/3D graphics application, printing hundredes of pages, PDF files with 100+MB size - never got CPDN halting screen/keyboard whatever.

Thrue is, that I don\'t run Outlook here (but the office one does), don\'t run AV from Symantec (but all boxes a NOD powered) so can\'t say much about Outlook/Symantec vs CPDN.

I would exclude \\BOINC from AV scanning/shield.
<i>phpBB forum for CPDN, all are </i><a href="http://www.climateprediction.net/board">invited</a>
ID: 19074 · Report as offensive     Reply Quote

Message boards : Number crunching : Domino effect leads to unrecoverable errors

©2024 cpdn.org