climateprediction.net (CPDN) home page
Thread 'HadCM3 short - errors galore'

Thread 'HadCM3 short - errors galore'

Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50838 - Posted: 16 Nov 2014, 20:18:09 UTC - in response to Message 50837.  

Was that me?

Yes Richard, it was indeed you - I'm obviously having an off day or two.

Thanks for the detailed reply and link, I'm quite happy to stick with the older version. If I move to Win 10 in a few years that may be different, but by then hopefully CPDN will have caught up. Although I see what you mean about "foreseeable". I guess I won't be holding my breath.

When it is safe for Win service users to update BOINC, perhaps you'd be so good as to post a message in the Windows section of the BB.
ID: 50838 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,038,916
RAC: 14,611
Message 50877 - Posted: 21 Nov 2014, 20:36:38 UTC - in response to Message 50838.  

Just had 2 short runs error with the following:

hadcm3s_4y4c_2007_2_009197478_0 Workunit 9323100
hadcm3s_4v8b_2007_2_009193733_0 Workunit 9319355
Exit status -529697949 (0xffffffffe06d7363)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x754BC42D

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.2.4


Dump Timestamp : 11/21/14 19:09:04
Install Directory : C:\Program Files (x86)\BOINC\
Data Directory : C:\ProgramData\BOINC
Project Symstore :
LoadLibraryA( C:\Program Files (x86)\BOINC\\dbghelp.dll ): GetLastError = 126
LoadLibraryA( dbghelp.dll ): GetLastError = 8
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 0, Write: 0, Other 0

- I/O Transfers Counters -
Read: 0, Write: 0, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 0, QuotaPeakPagedPoolUsage: 0
QuotaNonPagedPoolUsage: 0, QuotaPeakNonPagedPoolUsage: 0

- Virtual Memory Usage -
VirtualSize: 0, PeakVirtualSize: 0

- Pagefile Usage -
PagefileUsage: 0, PeakPagefileUsage: 0

- Working Set Size -
WorkingSetSize: 0, PeakWorkingSetSize: 0, PageFaultCount: 0

*** Dump of thread ID 6280 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x754BC42D



*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

ID: 50877 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 50878 - Posted: 21 Nov 2014, 21:38:09 UTC - in response to Message 50877.  

Just had 2 short runs error with the following:

hadcm3s_4y4c_2007_2_009197478_0 Workunit 9323100
hadcm3s_4v8b_2007_2_009193733_0 Workunit 9319355
Exit status -529697949 (0xffffffffe06d7363)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x754BC42D

I had three like that, and have given up HadCM3 short on this machine (i7-4771 and Win7 64-bit; BOINC 7.4.26, non-service).

Workunits:
9325625
9326394
9326302

Though a HadAM3P-HadRM3P Pacific North West has been running nicely for over a day, with another day to go.


ID: 50878 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50879 - Posted: 21 Nov 2014, 22:39:12 UTC

The Out Of Memory error is well known for these, and is about "stack memory".

Like several problems with several model types, it's being investigated.


ID: 50879 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 50881 - Posted: 24 Nov 2014, 19:46:18 UTC

To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 50881 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 50882 - Posted: 25 Nov 2014, 6:15:47 UTC - in response to Message 50881.  
Last modified: 25 Nov 2014, 6:22:59 UTC

To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks.

Hello everyone . . . and thank you . . . astroWX, Les, geophi, Iain Inglis, Andy, Hannah Rowlands . . . and staff, Volunteer Forum Moderators, scientist and Volunteer Crunchers.

News and Announcements' concerning HadCM3s year 2007 tasks Re. HadCM3s 'short' tasks for year 2007:
<quote>

From staff:
We have cancelled the outstanding hadcm3s 2007 workunits and contacted the scientist involved
who is going to investigate the issue further
. . . Andy

Like the set Les noted above,
this is another deprecated task-set and any in your Task list
should be aborted. Same with any 'retreads' which might fall to your machine(s).
We'll surely see this batch again after the scientist fixes the problem.

</quote>

It looks like i have only one HadCM3s 'short' tasks in my BOINC task queue for the year 2007 as follows:

as i understand it, i should abort the following ?

hadcm3s_50d3_2007_2_009200385

so I'm not sure about the following HadCM3s 'short' tasks
is it OK to leave these in my BOINC task queue ?

hadcm3s_4q3n_1985_2_009180193
hadcm3s_4q3n_1985_2_009180193
hadcm3s_4msg_1985_2_009175902
hadcm3s_3jzq_2005_2_009101956
hadcm3n_xbhr_1940_40_009151189
hadcm3s_4hdv_1997_2_009165933
hadcm3s_4cnu_1997_2_009159812
hadcm3s_49da_1997_2_009155544


Best Wishes and thank you,
Byron
ID: 50882 · Report as offensive     Reply Quote
Bellator
Avatar

Send message
Joined: 31 Mar 05
Posts: 44
Credit: 234,235
RAC: 0
Message 50884 - Posted: 25 Nov 2014, 7:19:43 UTC - in response to Message 50279.  

hadcm3s_4p1b_1985_2_009178813_0
Workunit 9306694
Created 12 Nov 2014 21:25:00 UTC
Sent 14 Nov 2014 6:30:59 UTC
Received 17 Nov 2014 13:31:55 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1288126
Report deadline 13 Feb 2015 13:58:10 UTC
Run time 132,053.72
CPU time 110,822.40
Validate state Initial
Claimed credit 0.00
Granted credit 0.00
application version UK Met Office HadCM3 short v7.24
Stderr show hide

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
est from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CP
and so on and so forth. It is shown as completed (success), but credit demanded and granted as 0. Is there a chance the credit will still be granted. should I do anything with this?
ID: 50884 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50885 - Posted: 25 Nov 2014, 7:35:20 UTC - in response to Message 50884.  

The credits situation is discussed in this.


ID: 50885 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 50887 - Posted: 25 Nov 2014, 15:48:48 UTC - in response to Message 50881.  

To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks.

as i understand it, i should abort the following?

hadcm3s_50d3_2007_2_009200385

so I'm not sure about the following HadCM3s 'short' tasks
is it OK to leave these in my BOINC task queue? or should I abort these task also?

hadcm3s_4q3n_1985_2_009180193
hadcm3s_4q3n_1985_2_009180193
hadcm3s_4msg_1985_2_009175902
hadcm3s_3jzq_2005_2_009101956
hadcm3n_xbhr_1940_40_009151189
hadcm3s_4hdv_1997_2_009165933
hadcm3s_4cnu_1997_2_009159812
hadcm3s_49da_1997_2_009155544

thanks
ID: 50887 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 50888 - Posted: 25 Nov 2014, 16:57:50 UTC - in response to Message 50887.  

To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks.

as i understand it, i should abort the following?

hadcm3s_50d3_2007_2_009200385

so I'm not sure about the following HadCM3s 'short' tasks
is it OK to leave these in my BOINC task queue? or should I abort these task also?

hadcm3s_4q3n_1985_2_009180193
hadcm3s_4q3n_1985_2_009180193
hadcm3s_4msg_1985_2_009175902
hadcm3s_3jzq_2005_2_009101956
hadcm3n_xbhr_1940_40_009151189
hadcm3s_4hdv_1997_2_009165933
hadcm3s_4cnu_1997_2_009159812
hadcm3s_49da_1997_2_009155544

thanks


Byron, Yes, abort the 2007 task, the others should be fine I think.
ID: 50888 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 50889 - Posted: 25 Nov 2014, 17:17:24 UTC - in response to Message 50888.  

Byron, Yes, abort the 2007 task, the others should be fine I think.

Hello geophi, with thanks Byron
ID: 50889 · Report as offensive     Reply Quote
Profile[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 21 Oct 10
Posts: 53
Credit: 2,101,753
RAC: 3,985
Message 50907 - Posted: 1 Dec 2014, 8:56:29 UTC

Hello people

I have again another short WU that seems to be dated 2007.

I have to abort it too ? (it started to crunch 45mn ago on my PC)

Why is the server not able to cancel this / the project team not able to flag them as "bad" ?

How long will we get such bad WUs ? (if they still are)

Thanks
ID: 50907 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50908 - Posted: 1 Dec 2014, 10:03:37 UTC - in response to Message 50907.  

I have to abort it too ? (it started to crunch 45mn ago on my PC)


Yes, best to abort it.

Why is the server not able to cancel this / the project team not able to flag them as "bad" ?


Most were withdrawn. As I understand it the problem is that those that got snapped up before the programmers knew about the problems would all have to be withdrawn individually as writing a script to catch all the re-issues isn't a quick and trivial job so they will keep getting reissued till they reach their limit. I forget how many this is.

Interestingly, your wingman on this workunit (computer 351510) seems to be crashing everything in sight with an error that at first sight looks like a missing 32bit library in Linux but is obviously not the usual problem as some tasks produce three or more trickle ups before crashing.
ID: 50908 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,038,916
RAC: 14,611
Message 50909 - Posted: 1 Dec 2014, 12:07:14 UTC - in response to Message 50908.  

Short and full res models are on 5 issues, PNW and ANZ - and probably EU - on 3.
ID: 50909 · Report as offensive     Reply Quote
Profile[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 21 Oct 10
Posts: 53
Credit: 2,101,753
RAC: 3,985
Message 50919 - Posted: 2 Dec 2014, 11:00:47 UTC

Mmmm I only saw your answer this morning, and guess what, the WU had crashed already after 15 hours of calculation...


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7645C41F


So I didn't have to cancel it, after all :)
ID: 50919 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 51180 - Posted: 11 Jan 2015, 21:25:38 UTC

I've just finished what I assume was one of the new batch of shorts task here (hadcm3s 1980, created 22 Dec.)

It crashed with the not uncommon error:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7731C42D

I've 3 more running, but I've now turned them off for my PC. I should know within a couple of days if these running are OK, but Iain's comment here does come to mind.

I've just noticed that another of the 22 Dec batch snuck through, and looking at the workunit here, all 5 tasks failed. 1 similar to the above, 1 with linux library issue, and 3 were by Invalid Theta. Given the frequency of the Invalid Theta's that have been occurring in these models, you really have to wonder if it is a programming issue rather than a model instability.
ID: 51180 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 51184 - Posted: 12 Jan 2015, 20:26:33 UTC - in response to Message 51180.  

OK, all 4 HadCM3 short tasks have now failed. 2 with out of memory error, 2 with Invalid Theta, see tasks sent 4 & 9 Jan here

Both of the Invalid Theta managed to get past the first timestep and failed with Output File Absent e.g. Output file hadcm3s_75qk_1980_2_009366458_0_2.zip for task hadcm3s_75qk_1980_2_009366458_0 absent

I am really struggling to believe that the HadCM3 short tasks are worth running, as the error rate is way too high. I can understand that models can go out of bounds with Invalid Theta, but given the huge number of these that happen, then surely the model is lacking. As for system errors, I agree with Iain - these should never have gotten out of Beta testing (same for modelling actually). Volunteers, the vast majority of whom probably never read these forums, should not have to continually interact with their PC to keep CPDN models running.

There are a huge number of volunteers out there, all creating additional CO2, and costs, by running these models in the hope that there is a slight chance it will make a difference. I suggest that if the researchers had to run these models on their own super computer, the problems would soon be sorted out.

BTW, the other models (ANZ, Africa, Europe, Full Res Ocean) seem to chug along just fine.
ID: 51184 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51185 - Posted: 12 Jan 2015, 20:32:43 UTC - in response to Message 51184.  

The hadcm3ps models are researching areas around the edge of stability of the parameters, and high failure rates are expected.

ID: 51185 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 51188 - Posted: 13 Jan 2015, 4:56:44 UTC - in response to Message 51185.  

Hi Les, if my memory serves me correctly, numbers of up to 30% failure was mentioned at some time as what to expect.

So now to foot in mouth time.

In my case, in HadCM3 short models issued since 10 Oct, I have had zero complete and around 180 error for whatever reason (this excludes all those we aborted.) Before this these models were completing OKish. Coincidentally, this is about the time I rolled BOINC back to 7.0.36 as stated in an early post below.

Out of interest I went to a page of my results that were all short models with errors in November. All 20 of mine failed with Invalid Theta, with a very short CPU time. When checking the workunits, 17 of those units went to completion on other boxes (mainly windows), the other 3 did not complete at all. So the model seems to OK, guess I should have done more digging earlier!!!!

I can only deduce that there was some change in the model/BOINC structure that makes my PC pretty crap with these models, possibly the running as a service issue.

I also looked at a couple of the PCs that completed a few of my failures, and they had run loads of HadCM3 short tasks and actually very few failures.
ID: 51188 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51191 - Posted: 13 Jan 2015, 7:34:00 UTC - in response to Message 51188.  

Hi Martin

Luck of the draw, I think.

You could go for one of the regional models, which are mostly "attribution studies" at the moment.
They should be more stable.


ID: 51191 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : HadCM3 short - errors galore

©2024 cpdn.org