climateprediction.net (CPDN) home page
Thread 'Error while computing'

Thread 'Error while computing'

Message boards : Number crunching : Error while computing
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53166 - Posted: 23 Dec 2015, 16:04:20 UTC
Last modified: 23 Dec 2015, 16:41:13 UTC

Just checked my reported tasks, and once again I find a task that has crashed.

Workunit# 10238153

Error while computing, is the report, but all 46,379 Time steps seems to have been been reported.

???

Error msg: The system cannot find the drive specified.
(0xf) - exit code 15 (0xf)

What drive? All 7 CPDN tasks are running, and they have all checkpointed OK within the last 5 minutes, so they can certainly find 'the' drive.

To me it seems that BOINC erases files too fast, read before CPDN has finished its checks.

Another Task trashed, thanks to BOINC and CPDN not sync'ing correct.

ChrisD

EDIT:

Checked the log. all checkpoints are reported in the log, all that is different from an OK task is the last two .zips are reported missing.

Earlier I have noticed BOINC deleting files even if all tasks were suspended.
ID: 53166 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 53167 - Posted: 23 Dec 2015, 17:05:43 UTC - in response to Message 53166.  

Not sure without checking if this is relevant but some of the WAH2 tasks were expected to fall over after completing due to the wrong length of task being specified. Not sure if this makes them think there are missing zips or not though.
ID: 53167 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53168 - Posted: 23 Dec 2015, 19:25:41 UTC - in response to Message 53166.  

The system cannot find the drive specified.

What drive?

YOUR hard drive.
Most likely because at that instant there was a flurry of activity from several models, all wanting to read or write files to the HD. When you have a 12 core processor, you need a very fast disk.

********************

One of the annoyances of BOINC, is that if a model crashes after creating a zip but before that zip can be uploaded, when BOINC gets around to informing the server about the crash, the zip will start to upload and then get deleted by BOINC as it gets to that part of the upload. Without a re-write of the code to make it more specific to cpdn, this is something that we just have to put up with. But at least the trickle_up files get through, so we get the credit. Touch wood.




ID: 53168 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53169 - Posted: 23 Dec 2015, 19:30:18 UTC - in response to Message 53167.  

Hi Dave

I'm not sure either, but I think that the problem that you mention is more likely to be this error message:
finish file present too long
as seen in this failed model.
ID: 53169 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53172 - Posted: 24 Dec 2015, 4:36:41 UTC
Last modified: 24 Dec 2015, 4:45:35 UTC

Thanks for the replies :)

Through SETI Beta I ended up reading about a BOINC bug, that causes BOINC to choke when it can not find the DNS Server, and thus trashing the tasks that were trying to upload files.

If BOINC can not fix this, maybe CPDN could be made a bit more 'BOINC-safe'?

Suggestion: a small mod to the exception handler, if 'file not found error', wait for 2 secs and retry, say twice before really giving up. (I know, back in DOS days, several retries had been done before reporting an Error, but this is a long way from DOS, so maybe this will help anyway.)

How about letting the CPDN task run for a few seconds after having created the final .zips, (a couple of dummy time loops will do) thus preventing BOINC from reporting the 'CPDN Task not running', prematurely?

Well, it's a comfort, that the trickles, at least gets through.
(Is there a way to make the BOINC Event Log tell what task has requested a trickle-up?)

ChrisD

p.s.

Sorry Les, this was just for fun.

"The system cannot find the drive specified.

What drive?"

I was just trying to say that such an error message is no good.
At least it should say which drive, and maybe even the name of the file the program is trying to access.
ID: 53172 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53173 - Posted: 24 Dec 2015, 4:54:50 UTC - in response to Message 53172.  

How about letting the CPDN task run for a few seconds after having created the final .zips,

As far as I know, all of the models run for 2-3 model days after the last zip point.

ID: 53173 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53348 - Posted: 27 Jan 2016, 14:28:58 UTC - in response to Message 53173.  

Just bought a fine new Samsung SSD 850 Pro 256 Gb.
This drive has been assigned solely to BOINC and its Projects, read CPDN for CPU and SETI Beta for GPU.
Just uploaded 7 _2 zips, so until now the new CPDN Tasks behave nicely. :)

Fingers Crossed.

ChrisD

ID: 53348 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53349 - Posted: 27 Jan 2016, 14:42:46 UTC - in response to Message 53348.  

If you install the Samsung Magician utility (4.9.5 is the latest), you can gain a free 1 GB cache by enabling Rapid Mode. That will protect the SSD and reduce the chance for errors even more.
ID: 53349 · Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 14 Aug 06
Posts: 22
Credit: 6,493,566
RAC: 12,967
Message 53421 - Posted: 11 Feb 2016, 20:11:25 UTC

One of my multi-core computers failed and will not be repaired and the following Tack IDs are provided in case you wish to retransmit the tasks now as the four have a reporting date of Dec. 2016. The already granted credits vary from 3987.46 to 4981.10 so they are well on the way to completion.

All are HadAN3P-HadRM3P
19195435
19195475
19195706
19195771

Hopefully this will be beneficial.

Austin, Texas USA

ID: 53421 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53422 - Posted: 11 Feb 2016, 20:53:24 UTC - in response to Message 53421.  

Sorry Billy, but the system is automated.
The only way for tasks to be re-issued early, is when they're Aborted. Which is not possible in your case.
This loss due to hardware failure happens every now and then.


ID: 53422 · Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 14 Aug 06
Posts: 22
Credit: 6,493,566
RAC: 12,967
Message 53423 - Posted: 12 Feb 2016, 5:18:01 UTC - in response to Message 53422.  

Thanks Les!!!
ID: 53423 · Report as offensive     Reply Quote
ian

Send message
Joined: 23 Feb 06
Posts: 20
Credit: 9,212,980
RAC: 4,904
Message 53450 - Posted: 16 Feb 2016, 18:20:21 UTC

I am getting many "computation errors" whist running WAH 7.08. probably recently 30% of the WU being processed.

running windows 10 fully updated as of 16th Feb 2016. Intel i7 4790 64bit.

Its getting beyond annoying any suggestions (I am not a computer expert by any means so I won't get any jargon)

Ian
ID: 53450 · Report as offensive     Reply Quote
jrapdx

Send message
Joined: 4 Jul 15
Posts: 63
Credit: 3,223,760
RAC: 0
Message 53451 - Posted: 16 Feb 2016, 19:35:58 UTC - in response to Message 53450.  
Last modified: 16 Feb 2016, 19:38:58 UTC

Sounds similar to problems I was having. See thread in Unix/Linux section where the problem was discussed, apparently a "bad batch" of tasks is implicated. We're told the server has been in maintenance mode to allow technicians to weed out the error-producing tasks. I'm not sure if that work has been completed.

In my case tasks were crashing immediately after download. I've temporarily set "no new tasks" to prevent this thrashing. When problem resolved will reset it.
ID: 53451 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53455 - Posted: 16 Feb 2016, 22:03:37 UTC - in response to Message 53450.  

Ian

Your computers are hidden, so some general thoughts.

Describing the models as WAH 7.08 is no longer useful, as there are numerous different batches of models these days, all of which will have different properties. And different potential problems.

There's a batch number now included in the model name, so you should use that. e.g., if there's a range of batch numbers involved, then it's most likely your computer. (One of mine: wah2_eu25_n15i_203112_12_333_010295219_0)

The specific error message for each task is in the Stderr list on that task's web page.

I'm becoming increasingly suspicious of Windows 10. It may not be the OS per se, but how MicroSoft is using their new update system to control the computers running it.

ID: 53455 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53458 - Posted: 17 Feb 2016, 2:34:40 UTC
Last modified: 17 Feb 2016, 2:37:05 UTC

Although probably not related to your errors, I found a new source of error today that I had not know about. When I rebooted one of my PCs, I lost a couple of the long-running WAH2's:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19261127
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19264442

It was annoying, since they had gone for over 5 days, and curiously several others running that long on the same PC at the same time did not error. But then I found that I was using about 2 GB of virtual memory on my Win7 64-bit Haswell machine, and it made some sense. Without enough working memory, some parts of the work units spill over into virtual memory. That seems OK for a while as long as they are running, but a reboot is not tolerable. As a result, I have increased the amount of working memory and will try it again.

(But then the machine crashed for an unrelated reason, so it will be a while.)
ID: 53458 · Report as offensive     Reply Quote
gmlew

Send message
Joined: 4 Feb 16
Posts: 2
Credit: 204,365
RAC: 0
Message 53493 - Posted: 21 Feb 2016, 23:57:34 UTC

I too am getting "Error while computing" on quite a few (maybe all) WU. Computer ID: 1389232 These errors seem to happen following a suspend/resume of BOINC/Project (left in memory) or machine restart.
Can I do anything to fix? How can I help? Are these actually problems?
From reading a few of the other forum posts I gathered that some issues may be caused by BOINC itself. Would it help to just run a Linux VM or is that even possible with these apps?

Thanks, GREG L.

#This is a bit of the log with 4 WU reporting about the same from **4.zip up to **12.zip or **13.zip
#
2/21/2016 11:37:39 AM | climateprediction.net | Message from task: 0
2/21/2016 11:37:39 AM | climateprediction.net | Computation for task wah2_eu25_n80f_203812_12_333_010297101_0 finished
2/21/2016 11:37:39 AM | climateprediction.net | Output file wah2_eu25_n80f_203812_12_333_010297101_0_4.zip for task wah2_eu25_n80f_203812_12_333_010297101_0 absent
# also including these WU
#
2/21/2016 11:37:44 AM | climateprediction.net | Computation for task hadam3p_afr50_bqep_201412_12_347_010322178_0 finished
2/21/2016 11:37:42 AM | climateprediction.net | Computation for task hadam3p_afr50_bpuo_201412_12_347_010321457_0 finished
2/21/2016 11:37:39 AM | climateprediction.net | Computation for task wah2_eu25_n81m_203812_12_333_010297144_1 finished
#with some more errors like this
#
2/21/2016 11:42:05 AM |  | Project communication failed: attempting access to reference site
2/21/2016 11:42:05 AM | climateprediction.net | Temporarily failed upload of hadam3p_afr50_bpuo_201412_12_347_010321457_0_13.zip: transient HTTP error
2/21/2016 11:42:05 AM | climateprediction.net | Backing off 00:03:58 on upload of hadam3p_afr50_bpuo_201412_12_347_010321457_0_13.zip
2/21/2016 11:42:07 AM |  | Internet access OK - project servers may be temporarily down.
ID: 53493 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53494 - Posted: 22 Feb 2016, 0:41:32 UTC

Computing errors are mostly to do with how the computer is used.

e.g. NEVER allow an anti-virus program to scan the 2 BOINC sections on your computer.
ALWAYS Exit from BOINC before shutting down the computer. (Preferably, also Suspend BOINC, then Exit.)

And there's the other items in my post here that may help.

And about the missing zips: here, in the middle of a thread.


ID: 53494 · Report as offensive     Reply Quote
gmlew

Send message
Joined: 4 Feb 16
Posts: 2
Credit: 204,365
RAC: 0
Message 53495 - Posted: 22 Feb 2016, 5:45:16 UTC

Thank You Les
ID: 53495 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 53496 - Posted: 22 Feb 2016, 9:21:04 UTC

Would it help to just run a Linux VM or is that even possible with these apps?


There has been a distinct lack of tasks for Linux over the past few weeks and my understanding is that gaps like this may occur more often than not. with that in mind I and a few others have switched to running windows tasks under WINE which although based on a very small sample set is to date giving me a 100% record including tow tasks that have already failed on two other machines, though one of these is going much more slowly than those from the other ANZ batch I ran. It should finish in a few hours time however.
ID: 53496 · Report as offensive     Reply Quote
jrapdx

Send message
Joined: 4 Jul 15
Posts: 63
Credit: 3,223,760
RAC: 0
Message 53497 - Posted: 22 Feb 2016, 20:16:51 UTC - in response to Message 53496.  

You inspired me to do the same, and I've been running BOINC/CPDN tasks for a few weeks now. One set of tasks (wah2) finished, and I have another progressing with the shorter tasks nearing completion, demonstrating that it can work well.

However I think it's worth pointing out that BOINC/CPDN under Wine is not all a bed of roses. I've experienced numerous "error while computing" task failures, some of which are likley attributable to Wine-related interruptions. Wine itself can be tricky to set up, I am still working on getting boincmgr.exe to start correctly when the computer unexpectedly reboots (as we are subject to random power failures here).

I found it was necessary to use the most recent development Wine versions, the earlier releases didn't work on my system. (Using Ubuntu 15.04/15.10 on stock hardware.) Wine 1.9.4 was just announced, with Ubuntu PPA latest is 1.9.3.
When I nail down the magic recipe for keeping all the plates spinning, I'll post the information.

ID: 53497 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Error while computing

©2024 cpdn.org