Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Was that me? Yes Richard, it was indeed you - I'm obviously having an off day or two. Thanks for the detailed reply and link, I'm quite happy to stick with the older version. If I move to Win 10 in a few years that may be different, but by then hopefully CPDN will have caught up. Although I see what you mean about "foreseeable". I guess I won't be holding my breath. When it is safe for Win service users to update BOINC, perhaps you'd be so good as to post a message in the Windows section of the BB. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,037,259 RAC: 14,576 |
Just had 2 short runs error with the following: hadcm3s_4y4c_2007_2_009197478_0 Workunit 9323100 hadcm3s_4v8b_2007_2_009193733_0 Workunit 9319355 Exit status -529697949 (0xffffffffe06d7363) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x754BC42D Engaging BOINC Windows Runtime Debugger... ******************** BOINC Windows Runtime Debugger Version 7.2.4 Dump Timestamp : 11/21/14 19:09:04 Install Directory : C:\Program Files (x86)\BOINC\ Data Directory : C:\ProgramData\BOINC Project Symstore : LoadLibraryA( C:\Program Files (x86)\BOINC\\dbghelp.dll ): GetLastError = 126 LoadLibraryA( dbghelp.dll ): GetLastError = 8 *** Dump of the Process Statistics: *** - I/O Operations Counters - Read: 0, Write: 0, Other 0 - I/O Transfers Counters - Read: 0, Write: 0, Other 0 - Paged Pool Usage - QuotaPagedPoolUsage: 0, QuotaPeakPagedPoolUsage: 0 QuotaNonPagedPoolUsage: 0, QuotaPeakNonPagedPoolUsage: 0 - Virtual Memory Usage - VirtualSize: 0, PeakVirtualSize: 0 - Pagefile Usage - PagefileUsage: 0, PeakPagefileUsage: 0 - Working Set Size - WorkingSetSize: 0, PeakWorkingSetSize: 0, PageFaultCount: 0 *** Dump of thread ID 6280 (state: Initialized): *** - Information - Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000 - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x754BC42D *** Debug Message Dump **** *** Foreground Window Data *** Window Name : Window Class : Window Process ID: 0 Window Thread ID : 0 |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Just had 2 short runs error with the following: I had three like that, and have given up HadCM3 short on this machine (i7-4771 and Win7 64-bit; BOINC 7.4.26, non-service). Workunits: 9325625 9326394 9326302 Though a HadAM3P-HadRM3P Pacific North West has been running nicely for over a day, with another day to go. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The Out Of Memory error is well known for these, and is about "stack memory". Like several problems with several model types, it's being investigated. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks. Hello everyone . . . and thank you . . . astroWX, Les, geophi, Iain Inglis, Andy, Hannah Rowlands . . . and staff, Volunteer Forum Moderators, scientist and Volunteer Crunchers. News and Announcements' concerning HadCM3s year 2007 tasks Re. HadCM3s 'short' tasks for year 2007: <quote> It looks like i have only one HadCM3s 'short' tasks in my BOINC task queue for the year 2007 as follows: as i understand it, i should abort the following ? hadcm3s_50d3_2007_2_009200385 so I'm not sure about the following HadCM3s 'short' tasks is it OK to leave these in my BOINC task queue ? hadcm3s_4q3n_1985_2_009180193 hadcm3s_4q3n_1985_2_009180193 hadcm3s_4msg_1985_2_009175902 hadcm3s_3jzq_2005_2_009101956 hadcm3n_xbhr_1940_40_009151189 hadcm3s_4hdv_1997_2_009165933 hadcm3s_4cnu_1997_2_009159812 hadcm3s_49da_1997_2_009155544 Best Wishes and thank you, Byron |
Send message Joined: 31 Mar 05 Posts: 44 Credit: 234,235 RAC: 0 |
hadcm3s_4p1b_1985_2_009178813_0 Workunit 9306694 Created 12 Nov 2014 21:25:00 UTC Sent 14 Nov 2014 6:30:59 UTC Received 17 Nov 2014 13:31:55 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 1288126 Report deadline 13 Feb 2015 13:58:10 UTC Run time 132,053.72 CPU time 110,822.40 Validate state Initial Claimed credit 0.00 Granted credit 0.00 application version UK Met Office HadCM3 short v7.24 Stderr show hide <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> est from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CP and so on and so forth. It is shown as completed (success), but credit demanded and granted as 0. Is there a chance the credit will still be granted. should I do anything with this? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks. as i understand it, i should abort the following? hadcm3s_50d3_2007_2_009200385 so I'm not sure about the following HadCM3s 'short' tasks is it OK to leave these in my BOINC task queue? or should I abort these task also? hadcm3s_4q3n_1985_2_009180193 hadcm3s_4q3n_1985_2_009180193 hadcm3s_4msg_1985_2_009175902 hadcm3s_3jzq_2005_2_009101956 hadcm3n_xbhr_1940_40_009151189 hadcm3s_4hdv_1997_2_009165933 hadcm3s_4cnu_1997_2_009159812 hadcm3s_49da_1997_2_009155544 thanks |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
To avoid some wasted effort: Please see the post in 'News and Announcements' concerning HadCM3s year 2007 tasks. Byron, Yes, abort the 2007 task, the others should be fine I think. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
Byron, Yes, abort the 2007 task, the others should be fine I think. Hello geophi, with thanks Byron |
Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985 |
Hello people I have again another short WU that seems to be dated 2007. I have to abort it too ? (it started to crunch 45mn ago on my PC) Why is the server not able to cancel this / the project team not able to flag them as "bad" ? How long will we get such bad WUs ? (if they still are) Thanks |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have to abort it too ? (it started to crunch 45mn ago on my PC) Yes, best to abort it. Why is the server not able to cancel this / the project team not able to flag them as "bad" ? Most were withdrawn. As I understand it the problem is that those that got snapped up before the programmers knew about the problems would all have to be withdrawn individually as writing a script to catch all the re-issues isn't a quick and trivial job so they will keep getting reissued till they reach their limit. I forget how many this is. Interestingly, your wingman on this workunit (computer 351510) seems to be crashing everything in sight with an error that at first sight looks like a missing 32bit library in Linux but is obviously not the usual problem as some tasks produce three or more trickle ups before crashing. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,037,259 RAC: 14,576 |
Short and full res models are on 5 issues, PNW and ANZ - and probably EU - on 3. |
Send message Joined: 21 Oct 10 Posts: 53 Credit: 2,101,753 RAC: 3,985 |
Mmmm I only saw your answer this morning, and guess what, the WU had crashed already after 15 hours of calculation...
So I didn't have to cancel it, after all :) |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
I've just finished what I assume was one of the new batch of shorts task here (hadcm3s 1980, created 22 Dec.) It crashed with the not uncommon error: Unhandled Exception Detected... I've 3 more running, but I've now turned them off for my PC. I should know within a couple of days if these running are OK, but Iain's comment here does come to mind. I've just noticed that another of the 22 Dec batch snuck through, and looking at the workunit here, all 5 tasks failed. 1 similar to the above, 1 with linux library issue, and 3 were by Invalid Theta. Given the frequency of the Invalid Theta's that have been occurring in these models, you really have to wonder if it is a programming issue rather than a model instability. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
OK, all 4 HadCM3 short tasks have now failed. 2 with out of memory error, 2 with Invalid Theta, see tasks sent 4 & 9 Jan here Both of the Invalid Theta managed to get past the first timestep and failed with Output File Absent e.g. Output file hadcm3s_75qk_1980_2_009366458_0_2.zip for task hadcm3s_75qk_1980_2_009366458_0 absent I am really struggling to believe that the HadCM3 short tasks are worth running, as the error rate is way too high. I can understand that models can go out of bounds with Invalid Theta, but given the huge number of these that happen, then surely the model is lacking. As for system errors, I agree with Iain - these should never have gotten out of Beta testing (same for modelling actually). Volunteers, the vast majority of whom probably never read these forums, should not have to continually interact with their PC to keep CPDN models running. There are a huge number of volunteers out there, all creating additional CO2, and costs, by running these models in the hope that there is a slight chance it will make a difference. I suggest that if the researchers had to run these models on their own super computer, the problems would soon be sorted out. BTW, the other models (ANZ, Africa, Europe, Full Res Ocean) seem to chug along just fine. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The hadcm3ps models are researching areas around the edge of stability of the parameters, and high failure rates are expected. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Hi Les, if my memory serves me correctly, numbers of up to 30% failure was mentioned at some time as what to expect. So now to foot in mouth time. In my case, in HadCM3 short models issued since 10 Oct, I have had zero complete and around 180 error for whatever reason (this excludes all those we aborted.) Before this these models were completing OKish. Coincidentally, this is about the time I rolled BOINC back to 7.0.36 as stated in an early post below. Out of interest I went to a page of my results that were all short models with errors in November. All 20 of mine failed with Invalid Theta, with a very short CPU time. When checking the workunits, 17 of those units went to completion on other boxes (mainly windows), the other 3 did not complete at all. So the model seems to OK, guess I should have done more digging earlier!!!! I can only deduce that there was some change in the model/BOINC structure that makes my PC pretty crap with these models, possibly the running as a service issue. I also looked at a couple of the PCs that completed a few of my failures, and they had run loads of HadCM3 short tasks and actually very few failures. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Martin Luck of the draw, I think. You could go for one of the regional models, which are mostly "attribution studies" at the moment. They should be more stable. |
©2024 cpdn.org