climateprediction.net (CPDN) home page
Thread 'Compute Errors on Pacific North West v7.22 Tasks'

Thread 'Compute Errors on Pacific North West v7.22 Tasks'

Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49573 - Posted: 16 Jul 2014, 20:56:28 UTC

Multiple hosts grinding out nothing but errors with the latest PNW tasks... :-(
ID: 49573 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 49574 - Posted: 16 Jul 2014, 21:07:19 UTC - in response to Message 49573.  

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?userid=520217&offset=0&show_names=0&state=5

gives message

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?userid=520217&offset=0&show_names=0&state=5


no acdess
ID: 49574 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49575 - Posted: 16 Jul 2014, 22:22:18 UTC

My apologies, I thought that was a freely available link. How about the links to these tasks? These are typical of what I'm seeing. Wingmen are failing, too.

Workunit 9031051
Workunit 9031079
Workunit 9029889
ID: 49575 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 49576 - Posted: 16 Jul 2014, 23:21:54 UTC

Thanks, ritterm. Reported to project as presumed configuration error.
ID: 49576 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49577 - Posted: 16 Jul 2014, 23:23:59 UTC - in response to Message 49573.  

Ritterm

The link that you quoted is the one to a site that used to be for the raw data for use by climate researchers. It needed to be separately logged into.
And it hasn't worked for over a year now. They were re-writing that site, but nothing has been heard of it.

***************

As for the problem that you mention, in the 1st 2 that you mention, it's the old INITTIME error.
Someone's made a boo-boo with one of the many files.

The 3rd link is a different problem, but the same for the 2 computers that failed it.

If you're selective with your choice of model type and stick with the EU lot, you shouldn't have any problems.

<Sigh>
Another email, another long wait while the world turns a bit.

ID: 49577 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49580 - Posted: 17 Jul 2014, 2:08:13 UTC

Happy to know that it's not just a problem I'm having, but sorry to see that it's a problem with the work...I was looking forward to running a regional model that I don't have a lot of time on. Oh, well, plenty of EUs to go around... :-)

Cheers,

MarkR
ID: 49580 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49585 - Posted: 17 Jul 2014, 22:55:21 UTC

Is the entire batch of current PNW work bad? Is anybody running good tasks right now?
ID: 49585 · Report as offensive     Reply Quote
bill brandt-gasuen2

Send message
Joined: 15 Dec 05
Posts: 4
Credit: 24,411,649
RAC: 37,356
Message 49586 - Posted: 18 Jul 2014, 4:57:13 UTC

All my PNWs are erroring out after about 15 seconds, so it's not an isolated incident. This on win xp 64-bit pro and win7 64-bit pro. Damn! Other wu's seem to be fine.
ID: 49586 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49587 - Posted: 18 Jul 2014, 6:59:26 UTC - in response to Message 49586.  

Bill

That's because of the INITTIME error, as mentioned a few posts down.

ID: 49587 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 49588 - Posted: 18 Jul 2014, 8:48:42 UTC
Last modified: 18 Jul 2014, 8:55:28 UTC

Yup, it seems that some users (meaning researchers who should know better - who use our compute time - sometimes make mistakes submitting models for us to crunch.. .


The whole damn batch is misconfigured.

Again.

I'm hoping that the way way upstream (excuse me but sometimes I think some of the upstream "researcher" clowns have no clue - AND yes I know that submitting to this CPDN site is a privilege for so-called researchers -
Some of the "researcher" sites who try to use this site are totally unreliable clots.

They submit bunches of misconfigured and totally slob crap from time to time.

I only hope that the reliable academic supporters of this site
Give real hard shit to the clowns who try to use this site, and then submit a few thousand misconfigured blunder-buggered- not-spec broken models - that all break, and waste contributors time.

I think that the academic supporters of this project --
should give the academic submitters of total broken misconfigured models an ultimatum.

Get you params right - now!

Your crunchers are getting annoyed at having incompetent slop crap thrown at them

Makes the whole process look crappy.


And yes - the INITTIME error is only the latest example



ID: 49588 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 126
Credit: 24,435,960
RAC: 23,907
Message 49590 - Posted: 18 Jul 2014, 12:13:19 UTC - in response to Message 49587.  


That's because of the INITTIME error, as mentioned a few posts down.

All PNW-models now crapping-out after 30 seconds or something with a INITTIME-error is a huge improvement since the previous batches...

... since these ran-through 100 re-starts due to "no heartbeat" before crapping-out and as a "bonus" left-behind around 300 MB of garbage on the hd.

Frankly, AFAIK PNW haven't worked since the upgrade to 7.22, a version AFAIK not even beta-tested before release so I've no idea why CPDN continues releasing new PNW-garbage before they've even tried to get it working as beta.



ID: 49590 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 49591 - Posted: 18 Jul 2014, 13:46:50 UTC

Last PNW to come to my machine was on 12th Feb this year. It completed.
ID: 49591 · Report as offensive     Reply Quote
Dave Worrall

Send message
Joined: 27 Jan 05
Posts: 16
Credit: 790,158
RAC: 0
Message 49592 - Posted: 18 Jul 2014, 15:23:44 UTC - in response to Message 49588.  

Yeah, down with slop crap.
ID: 49592 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 126
Credit: 24,435,960
RAC: 23,907
Message 49593 - Posted: 18 Jul 2014, 17:03:50 UTC - in response to Message 49591.  

Last PNW to come to my machine was on 12th Feb this year. It completed.

Ok, I forgot to specify it's all the Windows-PNW-tasks crapping-out, under different OS like Linux this batch is possibly worse since this time it's an input-file-error while not sure on the source of error for the "no heartbeat"-tasks.
ID: 49593 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49594 - Posted: 18 Jul 2014, 17:08:09 UTC

It looks like that batch has worked it's way through the system, at least for the most part...

ID: 49594 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 49595 - Posted: 18 Jul 2014, 17:13:07 UTC - in response to Message 49593.  

Last PNW to come to my machine was on 12th Feb this year. It completed.

Ok, I forgot to specify it's all the Windows-PNW-tasks crapping-out, under different OS like Linux this batch is possibly worse since this time it's an input-file-error while not sure on the source of error for the "no heartbeat"-tasks.

That has been traced to a BOINC API bug, which - coincidentally - resurfaced today when an application developer from another project tripped over it.

The bug doesn't affect all Windows machines. It only bites when BOINC v7 (and perhaps some of the very late BOINC v6.12.xx line) are installed 'as a service'.
ID: 49595 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 49596 - Posted: 18 Jul 2014, 19:46:06 UTC - in response to Message 49576.  

Thanks, ritterm. Reported to project as presumed configuration error.

... and now reported to the originating scientist. There is a shared interest here, in that volunteers want to run good models and so do the scientists: so this is the sort of error that gets sorted out.
ID: 49596 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49597 - Posted: 19 Jul 2014, 14:38:09 UTC
Last modified: 19 Jul 2014, 14:42:05 UTC

Is there any risk in continuing to work these jobs and accumulate a pile of compute errors? Does the project ever blacklist or withhold work from hosts that appear to be unreliable?

I've been continuing to poll for these so I don't miss the return of what I'm hoping will be error-free tasks. Considering the project's one hour backoff and near-immediate failure of the tasks, it doesn't seem to me to be wasting a great deal of time and resources. I don't know, maybe I'm crazy... :D
ID: 49597 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 49598 - Posted: 19 Jul 2014, 15:47:20 UTC

If you have a restriction on your Internet connection then you may be eating into your allowance to no great purpose and at a rather higher rate than would apply if the tasks were valid. Otherwise, there's no harm.

There is no concept, on this project, of an automatically-detected unreliable computer (or indeed a reliable one). There is the 'minussing' procedure but that's manual, in which badly behaved computers have their task download limit manually set to -1 as the result of a report here on the message boards.
ID: 49598 · Report as offensive     Reply Quote
Profileritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 49600 - Posted: 21 Jul 2014, 10:59:32 UTC

Progress, perhaps? Only a few minutes in on these new tasks that were created earlier today, but at least they didn't crash right away.

Workunit 9031281
Workunit 9031276

ID: 49600 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks

©2024 cpdn.org