Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
hello les i wondered what was going on - i checked to see how that work unit was processed by other folks, and i was the only person that had it... so the download attempt was just a mistake ??? frank |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Yes, I'm afraid you could call it that. Luckily it does no harm. Cpdn news |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
Hi everyone, yesterday I upgraded to BOINC 7.0.60 - - - so i just wanted to ask all you Long time season Pros - - - if every thing Looks OK ? Windows 7 CPU type GenuineIntel Intel(R) Xeon(R) CPU E5507 @ 2.27GHz [Family 6 Model 26 Stepping 5] Number of processors 8 - - - 8 physical CPU - no Hyper threading I'm only running one project - on this Computer - only Climate Prediction.net - 24 /7 the 8 Modles I'm now Crunching Reporting - one Error while computing on the following: name - - - - - - - - - - hadcm3n_39aw_1980_40_008283532 application - - - - - - UK Met Office Coupled Model Full Resolution Ocean created - - - - - - - - 14 Jan 2013 6:26:17 UTC 15/04/2013 9:00:29 AM | climateprediction.net | Started download of hadcm3n_39aw_1980_40_008283532.zip 15/04/2013 9:00:30 AM | climateprediction.net | Finished download of hadcm3n_39aw_1980_40_008283532.zip 15/04/2013 9:00:30 AM | climateprediction.net | Started download of ocean_39aw_1980_40_008283532_0.gz 15/04/2013 9:00:39 AM | climateprediction.net | Started download of atmos_39aw_1980_40_008283532_0.gz 15/04/2013 9:01:15 AM | climateprediction.net | Finished download of atmos_39aw_1980_40_008283532_0.gz 15/04/2013 9:01:36 AM | climateprediction.net | Finished download of ocean_39aw_1980_40_008283532_0.gz 15/04/2013 9:02:07 AM | climateprediction.net | Computation for task hadcm3n_39aw_1980_40_008283532_3 finished 15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_1.zip for task hadcm3n_39aw_1980_40_008283532_3 absent 15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_2.zip for task hadcm3n_39aw_1980_40_008283532_3 absent 15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_3.zip for task hadcm3n_39aw_1980_40_008283532_3 absent 15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_4.zip for task hadcm3n_39aw_1980_40_008283532_3 absent <core_client_version>7.0.60</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> I hope this helps, Byron |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
There was something wrong with some of the models created at that time. If you look at the stderr of all the models in the workunit you'll find they all crashed with the INITTIME error. A defect in the model that prevented it from starting. Cpdn news |
Send message Joined: 2 Mar 06 Posts: 27 Credit: 240,040 RAC: 0 |
Hi, I'm a committed participant, usually running 6 CPDN tasks simultaneously, but I rarely check my account. I just noticed that of the 16 tasks that ended in the last eight months, only one completed. All the others stopped with "Error While Computing" or "Error While Downloading". What gives? Should I bail out of CPDN? I hate to think that I am accomplishing nothing with my CPU cycles because of these errors. Anything I can do? Thanks, Steve |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Steve, Nearly all of your failures have been on the longer hadcm3n models. They failed at multiples of 25%, which is when the decadal uploads take place. This is a common problem on some PCs, and the cause is unclear as to why some PCs frequently have this problem, and other don't (or seldom do). I would continue running your current ones, but in your climateprediction.net specific preferences of your account page, select other model types, and not hadcm3n. Recently, availability of tasks has been inconsistent, so you might not get any new ones for awhile. |
Send message Joined: 6 Jun 11 Posts: 11 Credit: 356,113 RAC: 0 |
Hi, I finally seem to to past the download errors from past weeks but except for one WU that's still crunching all others in the last 2 days have failed. This is one heck of a lot of time crunching for nothing.. Any idea what's causing all these errors? http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728575 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728505 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728222 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15727928 and this that had to be aborted. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728098 Could someone tell me if this final task seems to be running correctly I have my doubts? http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15727960 Cheers, thanks for looking.. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Hi JugNut, Your running one trickled just over an hour ago so seems to be OK. Fingers crossed. I had the same error overnight. Out Of Memory (C++ Exception). Loads of memory here so shouldn't be able to run out. Checking your wingmen shows the same error on a couple of those failed WUs and the others have yet to show but as your ones all have the same error, I suspect they will fail as well. Someone will be along soon to tell us what's happening. [Edit] Just got a resend of another one of these so I'll try to pay attention to what happens to it. The other new resend I got didn't start for the wingman but has got past a few checkpoints so could be ok, unless it's one of the same batch. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Great! Another problem. I'll let "them" know. PS Jugnut It would be a good idea to upgrade to 7.0.60, which contains a fix for our download problems. Just in case. |
Send message Joined: 6 Jun 11 Posts: 11 Credit: 356,113 RAC: 0 |
Thanks for your time guy's. @ Ray Murray: I also have 16GB on each of those machines reporting error's so "out of memory" is probably not the problem for me either. I'm glad at least 1 WU seems to be still alive & kicking. Oh well try try again. Please keep us informed of any progress. @ Les Bayliss: Have just before upgraded to 7.0.62. Thanks. All the best JN |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
After some more digging; my failed wu didn't tidy up after itself so there is its 470MB folder (not modified since yesterday so probably nothing interesting in there), the sterr log still in the slot, but also an xml file showing a last update at timestep 25921, exactly where it would make the first trickle. Significant? From Boinc logs: [task] Process for hadcm3n_4f6k_1980_40_008350244_0 exited, exit code 3765269347, task state 1 before the [task] exit code -529697949 (0xe06d7363) and output files absent PS Another resend with the same error which I'm just going to Abort so I only have to concentrate on the 1 definately dodgy and 1 possibly ok. Final edit before editing timesout: Is there another flag I could set other than [task_debug] that would give more detailed info on the error? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I'm thinking "Out of memory C++" etc, is a compiler problem. We'll see. If the project people haven't screamed loudly and run off into the distance. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Both the resends from my message yesterday, that have been under observation, failed at the first trickle point (as expected) and again didn't tidy up. Those that have had a few of these will be building up a large amount of garbage with each failed wu leaving its c.470MB folder behind. Is there a server side cleanup option (maybe a forced reset?) or will people have to throw out this junk manually? Maybe a global message through Boinc? Speaking of junk; could you delete my double post from yesterday, please Les, just to tidy up the thread. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,826,970 RAC: 5,066 |
Is there a server side cleanup option (maybe a forced reset?) or will people have to throw out this junk manually?Not to my knowledge. Performing a project reset from within BOINC Manager will usually clear out the debris. If not then a detach (or remove) will get rid of everything. After re-attaching a host merge is usually necessary. Both options should only be attempted when no models are live, since they will also be removed. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Thanks Iain, I thought that was the case. I've just deleted the leftover files from the dead WUs as some PNWs have sneeked in while I wasn't looking. I was thinking more about those who don't visit the boards or even don't realise they have had a problem until they find they have used up more disk space than they were expecting. |
Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,522,141 RAC: 1,164 |
Out of Memory Errors - My experience is that on Windows XP computers the C++ out of memory error is trapped by Windows (with an error window) that ends up with the task getting a Computational Error when I click on OK. On LINUX the real memory gets sucked up and then the swap file and then the whole computer just sort of hangs. Sort of means it takes 30 seconds to recognize a mouse click. I have aborted these tasks when I realize what is going on. I don't think it matters how much memory you have. It will get all used up. There appear to be tasks in the system still being sent out with this problem. All of the tasks that I have aborted (LINUX) or had an Computational Error (Windows) have had similar problems by my wingmen. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
We're still trying to figure this out. Not all of the tasks are failing, and of those that have, not all have had this new error. The "out of memory" error most likely refers to stack space, not the total memory in any computer. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I have had two Out of Memory errors today and noticed that the address given for the errors is exactly the same for the two, viz Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75EDC41F . They had been running in parallel but failed about an hour apart. I am using BOINC 7.0.62 . |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
|
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
. We're still trying to figure this out. here is an other one of those: - Of Memory (C++ Exception) - (0xe06d7363) at address 0x7560812F hadcm3n_4kr6_1980_40_008353096_3 Model Crashed approx one hour ago. . |
©2024 cpdn.org