climateprediction.net (CPDN) home page
Thread 'Every last HADAM3P European Region ends in computation error'

Thread 'Every last HADAM3P European Region ends in computation error'

Message boards : Number crunching : Every last HADAM3P European Region ends in computation error
Message board moderation

To post messages, you must log in.

AuthorMessage
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 43678 - Posted: 18 Jan 2012, 19:39:39 UTC

Hello,

the last days I encountered a lot of computation errors on HADAM3P European Region-Models. I think that the problem isn�t on my side, because also the wingman showed computation errors. Please look here for example: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=7791693

Is there a bad batch of models running?

Do you have any suggestions?

Thanks

Nowi
ID: 43678 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 43679 - Posted: 18 Jan 2012, 19:46:03 UTC - in response to Message 43678.  

I know there is another thread that says something similar and it has been reported to the relevant team by one of the admins.

Dave
ID: 43679 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43680 - Posted: 18 Jan 2012, 21:22:34 UTC - in response to Message 43678.  

Nowi

On the list that you provided:
The first computer is failing because it's a Mac that's been upgraded without detaching/reattaching as per the sticky in the Mac section. (I'm about to report this.)
The last one seems to be failing because of a computer problem.

Which leaves yours. And I think that it's possible that it's your computer rather than the models.

The most recent problem with these models was with download errors, not computation errors. And this has been fixed.


Backups: Here
ID: 43680 · Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 43681 - Posted: 18 Jan 2012, 21:55:24 UTC

Thanks Les!

Of course it is possible that the problem is my computer, especially with your extra information. The last good result I returned on 15.01., after that every wu failed. I changed nothing on my configuration, only added Test4Theory again to my active project list...

I will have a look on my computer and the CPDN-tasks. I hope that the error will not persist.
ID: 43681 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 43682 - Posted: 18 Jan 2012, 23:57:43 UTC

It looks like they are all failing with the same error, only EU tasks, and it started in November.

My guess would be something associated with the files that the hadam3p EU needs has become corrupted or gone missing. Perhaps set climateprediction.net to no new work, then do a project reset, then allow work again. That should clear out the files in the climateprediction.net directory and allow a fresh batch of files to download.
ID: 43682 · Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 43683 - Posted: 19 Jan 2012, 16:26:25 UTC

Thanks Geophi!

By now the last task is running fine (about 44 % completed). I will watch it and, if an error would occur, I try your procedure.
ID: 43683 · Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 16 Jul 05
Posts: 32
Credit: 10,513,155
RAC: 0
Message 43684 - Posted: 20 Jan 2012, 7:50:38 UTC

I�ve got another one. All three exited with an error: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=7807866
ID: 43684 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43685 - Posted: 20 Jan 2012, 8:48:17 UTC - in response to Message 43684.  

Thanks.
I'll report the serial crasher.


Backups: Here
ID: 43685 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 43693 - Posted: 22 Jan 2012, 2:44:37 UTC - in response to Message 43685.  
Last modified: 22 Jan 2012, 3:21:45 UTC

On 2 systems had similar problems (Win Server 2008 and 2003 server):
On the 2003 server there was a popup error message,
    Microsoft Visual C++ Runtime Library,
    Runtime Error!
    Program:E:\BOINC\projects...

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    [OK]


When I closed the message a Climate task failed!
Another identical message popped up. This time I closed Boinc instead. I left it a few minutes and started Boinc again, but the same message appeared immediately. Hadcm3n_u3ff_1980_40_007683507_0 was sitting at 100% complete.
So, closed Boinc, then the message, and I am going to restart to try and prevent these run-away errors.

On the 2008 system when I closed and opened Boinc I did not get any more errors, well not yet. The error there was different, didn't contain the same message; actually didn't say much other than Error!

Could not see anything in the logs.

After the system restart I get the same pop up error MS VC++.
Boinc is running a Hadcm3n_u3ff_1980_40_007683509_0 task which is at 100% Elapsed time is 45h, so there would be another 265h or so to completion (going by the other CMFRO 6.07 tasks).
I would prefer not to lose that sort of time if possible.
Any suggestions?


ID: 43693 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43694 - Posted: 22 Jan 2012, 5:04:08 UTC - in response to Message 43693.  

If BOINC is displaying 100%, then it's lost contact with the model, which usually means that the model has crashed.
The only way to get it going again, is by restoring a backup made before the crash.
It's possible that the temperature graph will be all blue, and the Hours Elapsed and the Timestep will not be advancing, or at best, only for a while before starting in a constant loop.

The runtime error is something that happens to some computers, sometimes. There's no known cure.


Backups: Here
ID: 43694 · Report as offensive     Reply Quote
skgiven
Avatar

Send message
Joined: 5 Jun 06
Posts: 28
Credit: 2,790,048
RAC: 0
Message 43695 - Posted: 22 Jan 2012, 11:34:43 UTC - in response to Message 43694.  
Last modified: 22 Jan 2012, 12:16:25 UTC

Thanks Les.
I aborted the task sitting at 100%. The error message disappeared.
Closed and opened Boinc again and the error message did not return.
Might be worth knowing.
ID: 43695 · Report as offensive     Reply Quote

Message boards : Number crunching : Every last HADAM3P European Region ends in computation error

©2024 cpdn.org