|
Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Just started computing the new batch of HADAM3P_ANZ tasks, but both my machines has errors. My I7-3770K has ditched 3 tasks, all in just a few secs, but is processing 4 tasks OK. My Xeon 6790 has trashed 4 and is running two... Plenty of HDD space and plenty of RAM. I7 have 16 Gigs and the Xeon has 24, the max RAM possible. It happens so fast that I can not see what goes wrong :( Is it only me?? ChrisD |
![]() Send message Joined: 15 May 09 Posts: 4556 Credit: 19,039,635 RAC: 18,944 |
Clicking on the plus next to stderr on at least one of the failed tasks shows a replanca error which is a task problem rather than the computer it is running on. |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Just got 4 more tasks, 1 trashed in 20 secs flat another started OK, but I had to suspend processing for a while. 1/13/2016 12:58:48 PM | | Suspending computation - user request 1/13/2016 1:00:26 PM | | Resuming computation 1/13/2016 1:00:40 PM | climateprediction.net | Computation for task hadam3p_anz_h01n_201112_12_287_010252705_1 finished 1/13/2016 1:00:40 PM | climateprediction.net | Output file hadam3p_anz_h01n_201112_12_287_010252705_1_1.zip for task hadam3p_anz_h01n_201112_12_287_010252705_1 absent etc. This process had been processing for 16 min 33 secs... Tasks running now: 5, of which 4 has passed one hour of computing time and one of the new has reached 5 minutes. One task in queue, destiny unknown... I wish I could do some more debuging. Any suggestions? ChrisD (pulling his hair out) |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Clicking on the plus next to stderr on at least one of the failed tasks shows a replanca error which is a task problem rather than the computer it is running on. Thanks :) That is comforting to know. ChrisD |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904 |
Hi folks, I can confirm that 3 of the current HADAM3P_ ANZ tasks crashed with replanca error (and I'm the second user where these WUs failed), 2 others are currently running for more than an hour with no errors. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have had five HADAM3P_ ANZ tasks fail with "REPLANCA" errors on two machines running Win7 64-bit (BOINC 7.6.22), all in 11 or 12 seconds. And they failed on all other machines that have run them thus far (usually two others by now). But three other Australia/New Zealand tasks are running on one of my machines after 3 to 6 hours, so they will probably do OK. Interestingly, all the REPLANCA errors are on tasks dated 1996 or earlier, while the ones still running are dated 2002 or 2003, if that is any help. And I have had one task (dated 2007) fail with just a "Model crash detected, will try to restart..." after 14 seconds, but it is still running on one other machine. |
![]() Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
And I have had one task (dated 2007) fail with just a "Model crash detected, will try to restart..." after 14 seconds, but it is still running on one other machine. It's possible that it's running on another machine, but it may just have been downloaded and is waiting to run, or hasn't otherwise started for some other reason. We won't know for sure until either a trickle is returned, or it fails on that other host as well. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
We won't know for sure until either a trickle is returned, or it fails on that other host as well.It seems to be just sitting in the cache; no trickles thus far. It is now the only one I really need to keep watch of (with apologizes to Oxford English). http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19188042 |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Re the failing Task mentioned in msg. 53247. I suspended CPU using BOINC Manager, and when I reenabled CPU the task died. Here is the stderr: Can anybody tell me, what happended here?? <core_client_version>7.6.9</core_client_version> <![CDATA[ <stderr_txt> CPDN Monitor - Quit request from BOINC... Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=2592, selfPID=2592, iMonCtr=2 Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=1 Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=0, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6620, selfPID=6352, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... Called boinc_finish </stderr_txt> This is not the first task that has trashed when CPU has been suspended, so there is a problem somewhere, if that could be solved, we might save quite a few resends. ChrisD |
![]() Send message Joined: 15 May 09 Posts: 4556 Credit: 19,039,635 RAC: 18,944 |
The ANZ tasks have now been fixed and resent. |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Great :) Thanks. ChrisD |
![]() Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
That was staff's claim, Dave, but not all are healthy: <stderr_txt> "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
![]() Send message Joined: 15 May 09 Posts: 4556 Credit: 19,039,635 RAC: 18,944 |
That was staff's claim, Dave, but not all are healthy: Shucks. My suspicion is that there are two different problems around and only one of them has been fixed. Edit: I see from my email that the problem was a combination of files and settings that had all worked individually before, just that they were not compatible. Same email says that none of the re-submitted ones have failed three times yet............ |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
At half past four this morning, my internet went down. When I checked up on my crunchers this morning at 8, I found that my I7 cruncher had trashed all 7 running CPDN Tasks plus the 7 tasks that I had waiting in the queue. BOINC Event Log was not long enough to tell me what had happened. All tasks seem to have died due to No Heartbeat..... My Xeon Cruncher has survived and can tell me a little more. (Maybe because it has only 6 tasks running?) When the internet went down, CPDN tasks could no longer upload their Trickles. When this happens, they keep trying every 2 minutes.. 7 CPDN tasks that can not upload their trickles, seems to have trashed BOINC, and it had lost control completely. This is a devatating blow, and I will revoke this machine from CPDN until I have found out what really happened.. Any help would be appreciated. Thanks ChrisD |
![]() Send message Joined: 15 May 09 Posts: 4556 Credit: 19,039,635 RAC: 18,944 |
I have started by default turning off internet access in BOINC manager while crunching as it makes it easier for me to monitor what is happening - size of uploads etc. I did this primarily because this information was useful to the beta site people but that is currently not running. I have not heard of repeated attempts to access the internet crashing tasks before. I am also new to running windows tasks, running 4 using Wine for the first time at the moment. Edit: I see that some of your failures are resends having failed on another machine already and one has failed on two others. I wouldn't be in too much of a hurry to write off your i7 till more information comes in from other crunchers. |
![]() Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904 |
You might find a longer event file in the BOINC folder in program data. Its called stdoutdae and is a text file. Don't forget the ProgramData folder is hidden by default. |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
You might find a longer event file in the BOINC folder in program data. Its called stdoutdae and is a text file. Don't forget the ProgramData folder is hidden by default. Tnx :) Found it, and it told me the same sad story. After the internet went down, BOINC kept trying every minute to upload trickles to CPDN. Finally it simply lost patience and started trashing each and every CPDN task in an inferiour rage. 'Reporting xx completed Tasks' repeatedly together with result .zip files none of which ever made it out from here. Strangely enough, SETI Beta that runs om my GPU survived. A lot of results were waiting to be uploaded, but when internet was back the result queue cleared. Seems that BOINC/CPDN combined does not tolerate a bad internet. :( ChrisD All tasks were trashed with the excuse, no Heartbeat. Maybe someone should make this service a bit more robust?? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I think that the "No heartbeat" stuff is when your computer gets very busy, and BOINC "can't get a word in edgewise". So it gives up waiting, and Aborts things. Perhaps an anti virus is grabbing every file before it's allowed to run so that it can be checked? |
Send message Joined: 8 Aug 04 Posts: 69 Credit: 1,561,341 RAC: 0 |
Sorry, but my Xeon cruncher just trashed a task 5 trickles down the list. BUT: this machine has just a vanilla windows 7 install. Nothing besides BOINC and BOINC-Tasks are installed. Still this heartbeat error trashes my CPDN tasks at random. Am I really the only one fighting these errors. If I am and nothing is done about it, I am not sure I can afford to compute for the trashcan much longer.. Sorry. ChrisD |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
About 12 hours ago, my Haswell, running Linux with a Windows XP clone running under Wine, completed 4 ANZ models with no drama. Also, I don't run them at maximum, so as to give the OS some processors to use if it wants to do something. In this case, it's only the real processors and not also the Hyper-threaded. The lack of anyone else posting about problems with them, could be due to them being "set and forget", or because no one else is having a problem. One would need access to the data base to see. But it does look as though you're the only one. |
©2025 cpdn.org