climateprediction.net (CPDN) home page
Thread 'How normal is it for work units to abort?'

Thread 'How normal is it for work units to abort?'

Message boards : Number crunching : How normal is it for work units to abort?
Message board moderation

To post messages, you must log in.

AuthorMessage
Grahamt

Send message
Joined: 23 Jul 10
Posts: 9
Credit: 2,099,795
RAC: 0
Message 40356 - Posted: 12 Aug 2010, 15:13:26 UTC

Hi,

I'm new to climateprediction.net, so apologies for ignorance. I've had four work units so far and all of them have finished early. The first one got about a third of the way through, but the rest have ended quite soon after they've begun. Here's an excerpt from my message log, relating to two work units (I've highlighted lines that look significant):

12/08/2010 15:12:24 Resuming computation
12/08/2010 15:13:02 climateprediction.net Computation for task famous_uawz_1899_200_006646990_4 finished
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_2.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_3.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_4.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_5.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_6.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_7.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_8.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_9.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_10.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_11.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_12.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_13.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_14.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_15.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_16.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_17.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_18.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_19.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:02 climateprediction.net Output file famous_uawz_1899_200_006646990_4_20.zip for task famous_uawz_1899_200_006646990_4 absent
12/08/2010 15:13:25 Suspending computation - CPU usage is too high
12/08/2010 15:13:35 Resuming computation
12/08/2010 15:14:05 climateprediction.net Sending scheduler request: To fetch work.
12/08/2010 15:14:05 climateprediction.net Reporting 1 completed tasks, requesting new tasks
12/08/2010 15:14:07 climateprediction.net Scheduler request completed: got 1 new tasks
12/08/2010 15:14:09 climateprediction.net Started download of famous_ubtn_1199_200_006648166.zip
12/08/2010 15:14:09 climateprediction.net Started download of dump_r3x3_20a_1199.gz
12/08/2010 15:14:10 climateprediction.net Finished download of famous_ubtn_1199_200_006648166.zip
12/08/2010 15:14:10 climateprediction.net Started download of dump_r3x3_20o_1199.gz
12/08/2010 15:14:13 climateprediction.net Finished download of dump_r3x3_20a_1199.gz
12/08/2010 15:14:17 climateprediction.net Finished download of dump_r3x3_20o_1199.gz
12/08/2010 15:14:17 climateprediction.net Starting famous_ubtn_1199_200_006648166_4
12/08/2010 15:14:17 climateprediction.net Starting task famous_ubtn_1199_200_006648166_4 using famous version 611
12/08/2010 15:16:49 climateprediction.net Computation for task famous_ubtn_1199_200_006648166_4 finished

12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_1.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_2.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_3.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_4.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_5.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_6.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_7.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_8.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_9.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_10.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_11.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_12.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_13.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_14.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_15.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_16.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_17.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_18.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_19.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:16:49 climateprediction.net Output file famous_ubtn_1199_200_006648166_4_20.zip for task famous_ubtn_1199_200_006648166_4 absent
12/08/2010 15:17:52 climateprediction.net Sending scheduler request: To fetch work.
12/08/2010 15:17:52 climateprediction.net Reporting 1 completed tasks, requesting new tasks
12/08/2010 15:19:03 climateprediction.net Scheduler request failed: HTTP gateway timeout
12/08/2010 15:20:03 climateprediction.net Sending scheduler request: To fetch work.
12/08/2010 15:20:03 climateprediction.net Reporting 1 completed tasks, requesting new tasks
12/08/2010 15:20:04 climateprediction.net Scheduler request completed: got 0 new tasks
12/08/2010 15:20:04 climateprediction.net Message from server: Server can't open database
12/08/2010 15:50:33 climateprediction.net update requested by user
12/08/2010 15:50:35 climateprediction.net Sending scheduler request: Requested by user.
12/08/2010 15:50:35 climateprediction.net Reporting 1 completed tasks, requesting new tasks
12/08/2010 15:50:37 climateprediction.net Scheduler request completed: got 0 new tasks
12/08/2010 15:50:37 climateprediction.net Message from server: Completed result famous_ubtn_1199_200_006648166_4 refused: result already reported as error
12/08/2010 15:50:37 climateprediction.net Message from server: No work sent
12/08/2010 15:50:37 climateprediction.net Message from server: (reached daily quota of 1 tasks)


I'm wondering if there's anything about my setup that isn't suitable for this project. My PC is fairly old now - bought in 2005. It uses an Athlon 64 processor, with 2GB RAM. I don't have any problems with another BOINC project, SETI@home, but the work units here are much bigger - nominal completion time is 265 hours, but I think it would take around 400 hours in practice.

Is there any point in downloading work units that never complete, or is it normal for some models to finish early?

Graham
ID: 40356 · Report as offensive     Reply Quote
Grahamt

Send message
Joined: 23 Jul 10
Posts: 9
Credit: 2,099,795
RAC: 0
Message 40357 - Posted: 12 Aug 2010, 15:36:04 UTC - in response to Message 40356.  

Apologies - I've now read the 'Famous success / failure ratio' thread started by Jim. I can see that it's not unusual for WUs to crash.

However, I guess my question remains. Given my somewhat antiquated setup (5-year-old Athlon 64 3200 procesor, 2GB RAM, running Windows XP), is it worth continuing to contribute to climateprediction.net?

Graham
ID: 40357 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 40358 - Posted: 12 Aug 2010, 15:49:30 UTC - in response to Message 40357.  

If other projects are crunching fine, then I'd say keep running CPDN. It is good to volunteer what you have, and each of us do that same.

Common issues with older computers include corrupt operating system (e.g., Windows), failing hard drives, deteriorating power supplies, motherboard capacitor aging (causing voltage drops), and memory integrity. If you can verify these are not problems, then you're good to crunch.
ID: 40358 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,856,833
RAC: 4,824
Message 40360 - Posted: 12 Aug 2010, 18:49:14 UTC
Last modified: 14 Aug 2010, 9:41:40 UTC

Graham,

The main thing to check is whether the models are crashing with a physics-related error message or with something that suggests the type of thing in DJStarfox's list.

If you look at a task result page - e.g. here - and expand the Stderr field and see the model ending ...

Model crashed: ATM_DYN : INVALID THETA DETECTED.

or

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.

... then that's an unviable model. The FAMOUS models are very prone to this and it's expected that many of them will fail.

Three out of five of your models have failed that way. The other two failed with a -185 error, which needs looking into.

Iain
ID: 40360 · Report as offensive     Reply Quote
old_user92639

Send message
Joined: 13 Aug 05
Posts: 54
Credit: 117,227
RAC: 0
Message 40362 - Posted: 12 Aug 2010, 22:47:29 UTC - in response to Message 40360.  

Graham,

The main thing to check is whether the models are crashing with a physics-related error message or with something that suggests the type of thing in DJStarfox's list.

If you look at a task result page - e.g. here - and expand the Stderr field and see the model ending:

Model crashed: ATM_DYN : INVALID THETA DETECTED.

or 'negative pressure' then that's an unviable model. The FAMOUS models are very prone to this and it's expected that many of them will fail.

Three out of five of your models have failed that way. The other two failed with a -185 error, which needs looking into.

Iain



Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy



ID: 40362 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 40371 - Posted: 14 Aug 2010, 20:48:55 UTC

The part highlighted in red is a standard part of the diagnostic message. It doesn't help. (It's off the page and not seen unless you copy the entire line, in which case all but one spaces are eliminated.) I simply grab the "INVALID THETA" OR "NEGATIVE PRESSURE" part because the rest of the line is standard. Either way, those two messages mean an unstable parameter set caused the crash. (The scientists can't be sure in advance which combinations are unstable -- and that's part of what they want to know from these Tasks.)
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 40371 · Report as offensive     Reply Quote
old_user46232

Send message
Joined: 29 Jan 05
Posts: 7
Credit: 784,071
RAC: 0
Message 40380 - Posted: 18 Aug 2010, 14:35:32 UTC
Last modified: 18 Aug 2010, 14:36:19 UTC

I came back to climateprediction about two month or so back, after a long leave and naturally also ran into this errors. I quickly found out that this is normal.
What I don't really understand about this though is that if "INVALID THETA" or "NEGATIVE PRESSURE" are problems with the model, I would expect them to fail at the same point and therefore approximately the same credit claimed on all machines. I noticed however that the point where the models fail with this message vary widely and even complete for some persons. Can anyone of you explain that to me?
ID: 40380 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 40381 - Posted: 18 Aug 2010, 16:11:50 UTC - in response to Message 40380.  

Because all hosts are not created equally. At least that is how I understand it. There are differences in the type of processors used (AMD/Intel). As if that is not enough, the type of Operating System (Windows/Darwin/Linux) also makes a difference.
ID: 40381 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,856,833
RAC: 4,824
Message 40383 - Posted: 18 Aug 2010, 22:32:00 UTC

Yes, that's right. It is usually the case that models run on the same combination of operating system and processor will all succeed or all fail at the same point. Of course models also fail because there's a specific problem on a computer - e.g. permissions, hardware etc. - and that type of error obviously won't be reproduced.
ID: 40383 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40384 - Posted: 19 Aug 2010, 1:59:53 UTC - in response to Message 40383.  

I have been running a HADCM3 model since June 17 2009 on my Linux box, with AMD Opteron 1210 and SuSE Linux 11.1. All my three wingmen failed on June 18, all of them with Intel processors. One used a Linux kernel 2.6.30, newer than my 2.6.27. the others a Darwin 10.2 and 10.4. From this one could think that the crucial factor is the processor's make. Mine has two cores and runs at 1.8 GHz, not overclocked (I never overclock a CPU, this is a frequent cause of errors).
Tullio
ID: 40384 · Report as offensive     Reply Quote
old_user46232

Send message
Joined: 29 Jan 05
Posts: 7
Credit: 784,071
RAC: 0
Message 40386 - Posted: 19 Aug 2010, 9:27:49 UTC

So then the results of a completed run would differ too between different operating systems and processor's make?
ID: 40386 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,856,833
RAC: 4,824
Message 40387 - Posted: 19 Aug 2010, 9:53:04 UTC - in response to Message 40386.  

So then the results of a completed run would differ too between different operating systems and processor's make?

Yes. The variations are equivalent to changes in initial conditions according to a paper referred to on the publications page - scan down the page for "Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models".
ID: 40387 · Report as offensive     Reply Quote

Message boards : Number crunching : How normal is it for work units to abort?

©2024 cpdn.org