climateprediction.net (CPDN) home page
Thread 'Compute Errors / Bad Work Units?'

Thread 'Compute Errors / Bad Work Units?'

Message boards : Number crunching : Compute Errors / Bad Work Units?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 46831 - Posted: 21 Aug 2013, 20:55:37 UTC

It looks like there are some recently geerated WU's out there that are bad:

8559375
8559214

Most hosts are crashing shortly after the start with similar stderr output:

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch 2048
.
.
.
Sorry, too many model crashes! :-(
Called boinc_finish

</stderr_txt>
]]>
ID: 46831 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46835 - Posted: 21 Aug 2013, 21:33:35 UTC - in response to Message 46831.  

Oh great. OK, another email. :(
Thanks for the heads up.

ID: 46835 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46902 - Posted: 28 Aug 2013, 4:12:39 UTC - in response to Message 46835.  

Same with me on:

WU 8559724 Task 15941889

It ran a total of 22secs and other peoples tasks also failed. 3 immediately, one ofter 30mins.

Regards,
Martin

ID: 46902 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 46930 - Posted: 1 Sep 2013, 13:13:58 UTC

...my machine made it to 96% on work unit 8537815 before keeling over with an exit code of 22...i think it was perhaps OS confusion caused by other things that were running, but don't know for sure...other machines gave up much sooner...

while looking at other folks progress on this work unit, i noticed that machine 1105670 has crashed on nearly every work unit it has attempted...any obvious reason for this situation ???

frank
ID: 46930 · Report as offensive     Reply Quote
Bellator
Avatar

Send message
Joined: 31 Mar 05
Posts: 44
Credit: 234,235
RAC: 0
Message 46931 - Posted: 1 Sep 2013, 15:07:58 UTC - in response to Message 46930.  

For what it's worth: 8549908 has been running for well over 400 hours and I have at least 20 trickles. Just lucky, I guess...
ID: 46931 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 46933 - Posted: 1 Sep 2013, 16:17:11 UTC

My best guess is that machine 1105670 needs to reset the project.
ID: 46933 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 46934 - Posted: 1 Sep 2013, 18:54:12 UTC
Last modified: 1 Sep 2013, 18:55:19 UTC

Old Athlon XP, not compatible with HadCM3N tasks.

Thanks for pointing it out! I'll notify Andy so he can cut the machine's water off.

[Edited for typo.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 46934 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 46935 - Posted: 1 Sep 2013, 19:15:23 UTC

These might be "typical" model crashes, but the stderr output is different than I've seen in my limited CPDN experience. Just pointing them out in case they are new/relevant.

Task 15942401 -- Model crashed: ATM_DYN : INVALID THETA DETECTED.

Task 15930887 -- Model crashed: INITTIME: Atmosphere basis time mismatch

Task 15899492 -- terminate called after throwing an instance of 'St9bad_alloc'


ID: 46935 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46936 - Posted: 1 Sep 2013, 20:21:46 UTC - in response to Message 46935.  

INVALID THETA
A standard message. Means that the physical environment became impossible.
(The gravity suddenly disappeared, and all of the air floated into space.)
(The temperature suddenly rose to a million degrees, and everything was cooked to a crisp.)
(etc.)
This is one of the things that the researchers are looking for: "For how long will this model run, given the stating values used?"

Atmosphere basis time mismatch
Another standard message.
Oops. The research assistant specified the wrong number of values somewhere.

Alloc and Malloc.
Memory allocation errors. This has shown up a few times, but I don't think that the problem was identified.


ID: 46936 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 46937 - Posted: 2 Sep 2013, 3:10:06 UTC

Thanks for the information, Les. Clearly my memory (and search skills) are lacking as I didn't realize the "time mismatch" error was the one I noted at the beginning of the thread! :D
ID: 46937 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 46977 - Posted: 6 Sep 2013, 8:32:32 UTC

astroWX:

machine 1184413 seems to also be chewing up work units without any useful results...

frank
ID: 46977 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 46978 - Posted: 6 Sep 2013, 13:22:58 UTC

As do 982003 and 1186330 two of my wingmen on one of my two tasks and this one 1234665 hasn't completed anything in the past year.

I know this computer's record is not exemplary but it seems there are a lot of boxes out there that never seem to complete tasks.......
ID: 46978 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 46981 - Posted: 6 Sep 2013, 18:50:12 UTC
Last modified: 6 Sep 2013, 19:13:55 UTC

Thanks for those, Frank and Dave. Message sent to Andy.

Given that it is ~1945 BST, it is unlikely that anything will be done until Monday.

[EDIT: What Andy does with your information: Sets max. tasks per day to -1 for the machines, which prevents more work being sent. An email is then sent to the owner stating what was done and requesting that they fix the problem, then post on these boards to tell us the fix was made, or to ask questions. When fixed, Andy resets the machine's account to allow more work. Unfortunately, not many of the people post, as compared to the number of emails sent.]

[EDIT2: Neglected to mention: '1234665' had two successes of nine tasks. 'Anonymous' seems to have given-up in February, so nothing done on that one.]

Copy:
Hi, Andy,

Three more brought to our attention by participants:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1186330 *
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=982003 **
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1184413 ***

* Mix of troubles; completes HadAM3P tasks
** Some success in the past but all now have SIGSEGV errors.
*** Darwin issue

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 46981 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 47180 - Posted: 27 Sep 2013, 1:26:39 UTC

astroWX:

machine 1270234 seems to be spinning its wheels also...

frank
ID: 47180 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 47181 - Posted: 27 Sep 2013, 5:36:32 UTC - in response to Message 47180.  

Good catch, Frank. Thanks. It's reported to Andy at the head-shed.


There is now a thread for folks willing to dig-out and report ne'er-do-well machines (so we can keep them together).
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7674

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 47181 · Report as offensive     Reply Quote
old_user101069
Avatar

Send message
Joined: 4 Oct 05
Posts: 12
Credit: 610,967
RAC: 0
Message 47209 - Posted: 30 Sep 2013, 17:33:15 UTC
Last modified: 30 Sep 2013, 17:35:40 UTC

My computer, 1281635, has not completed a cpdn workunit in recent memory. Failure is wildly different per unit, and I have not been able to identify a pattern. I've always assumed that the work completed was still valuable, so I let it keep chugging away. Way back when I was a student, I used to back up my tasks, and restore them after an error, but I just don't have time to manage that anymore. If someone wants to take a look and tell me if I should just abandon the project, I'd be interested to hear it. (This is something that only affects this project. My pc is otherwise stable and goes for weeks without reboot.)


Thanks!

(edit: heh, I guess the second-most-recent task I got completed successfully. I just assumed it didn't since it was so short. nevertheless, it's still true that the vast majority of my tasks fail.)
ID: 47209 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 47212 - Posted: 30 Sep 2013, 18:26:06 UTC - in response to Message 47209.  
Last modified: 30 Sep 2013, 18:30:06 UTC

My computer, 1281635, has not completed a cpdn workunit in recent memory. ...

Refreshing your memory ... the project has been happy with this one :-)

p.s.: CPDN results are sometimes hard stuff, so a (basically) reliable host that does not trash workunit after workunit within really short time should always have a chance to complete some results. If I were you, I would keep trying (I trashed way more than a handful too btw.).
ID: 47212 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 47213 - Posted: 30 Sep 2013, 18:29:25 UTC

uioped1 -

Until someone smarter than me tells you different -

The work you are doing and sending (each trickle) is valuable even if the model doesn't run all the way to completion.

One of your results I looked at had a "bad theta" which is not your problem.

I do notice that your other results get suspended a lot. There might be some parameter you have set in BOINC that is causing this - for example, if you have CPU usage set to less than 100%. These models don't like to be interrupted even though technically that should be OK.

Make sure you have "leave applications in memory when suspended" OFF.

If you absolutely have to shut BOINC down or turn off your computer, go to the Projects tab and suspend ClimatePrediction". This will save everything to disk. Then shut BOINC down. Then turn your computer off.


ID: 47213 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 47214 - Posted: 30 Sep 2013, 18:33:22 UTC - in response to Message 47213.  

...
Make sure you have "leave applications in memory when suspended" OFF.
...

Not in all cases - a box with 10GB physical RAM that runs 24/7 can affort to leave the stuff in memory and a restart from checkpoint is always more risky than just restarting a suspended task.
ID: 47214 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47215 - Posted: 30 Sep 2013, 18:51:07 UTC



I personally would use the following settings:

* Suspend work if CPU usage above %
0 (i.e., do not suspend)


* Leave tasks in memory while suspended?
Yes


* Suspend work while computer is in use?
No



As WB8ILI says, when you are shutting down your PC, first suspend boinc, wait a few moments, then shut down Boinc. This gives the models a chance to shut down cleanly rather than being killed by Windows during the shutdown process. Similarly, if you are about to do something intensive on the PC (for example, gaming), then it is a good idea to shut down boinc then also.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47215 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Compute Errors / Bad Work Units?

©2024 cpdn.org