Message boards : Number crunching : Compute Errors on Pacific North West v7.22 Tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Multiple hosts grinding out nothing but errors with the latest PNW tasks... :-( |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,006,502 RAC: 21,456 |
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?userid=520217&offset=0&show_names=0&state=5 gives message http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?userid=520217&offset=0&show_names=0&state=5 no acdess |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
My apologies, I thought that was a freely available link. How about the links to these tasks? These are typical of what I'm seeing. Wingmen are failing, too. Workunit 9031051 Workunit 9031079 Workunit 9029889 |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Thanks, ritterm. Reported to project as presumed configuration error. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Ritterm The link that you quoted is the one to a site that used to be for the raw data for use by climate researchers. It needed to be separately logged into. And it hasn't worked for over a year now. They were re-writing that site, but nothing has been heard of it. *************** As for the problem that you mention, in the 1st 2 that you mention, it's the old INITTIME error. Someone's made a boo-boo with one of the many files. The 3rd link is a different problem, but the same for the 2 computers that failed it. If you're selective with your choice of model type and stick with the EU lot, you shouldn't have any problems. <Sigh> Another email, another long wait while the world turns a bit. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Happy to know that it's not just a problem I'm having, but sorry to see that it's a problem with the work...I was looking forward to running a regional model that I don't have a lot of time on. Oh, well, plenty of EUs to go around... :-) Cheers, MarkR |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Is the entire batch of current PNW work bad? Is anybody running good tasks right now? |
Send message Joined: 15 Dec 05 Posts: 4 Credit: 24,409,993 RAC: 37,586 |
All my PNWs are erroring out after about 15 seconds, so it's not an isolated incident. This on win xp 64-bit pro and win7 64-bit pro. Damn! Other wu's seem to be fine. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Bill That's because of the INITTIME error, as mentioned a few posts down. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Yup, it seems that some users (meaning researchers who should know better - who use our compute time - sometimes make mistakes submitting models for us to crunch.. . The whole damn batch is misconfigured. Again. I'm hoping that the way way upstream (excuse me but sometimes I think some of the upstream "researcher" clowns have no clue - AND yes I know that submitting to this CPDN site is a privilege for so-called researchers - Some of the "researcher" sites who try to use this site are totally unreliable clots. They submit bunches of misconfigured and totally slob crap from time to time. I only hope that the reliable academic supporters of this site Give real hard shit to the clowns who try to use this site, and then submit a few thousand misconfigured blunder-buggered- not-spec broken models - that all break, and waste contributors time. I think that the academic supporters of this project -- should give the academic submitters of total broken misconfigured models an ultimatum. Get you params right - now! Your crunchers are getting annoyed at having incompetent slop crap thrown at them Makes the whole process look crappy. And yes - the INITTIME error is only the latest example |
Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,435,960 RAC: 23,907 |
All PNW-models now crapping-out after 30 seconds or something with a INITTIME-error is a huge improvement since the previous batches... ... since these ran-through 100 re-starts due to "no heartbeat" before crapping-out and as a "bonus" left-behind around 300 MB of garbage on the hd. Frankly, AFAIK PNW haven't worked since the upgrade to 7.22, a version AFAIK not even beta-tested before release so I've no idea why CPDN continues releasing new PNW-garbage before they've even tried to get it working as beta. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,006,502 RAC: 21,456 |
Last PNW to come to my machine was on 12th Feb this year. It completed. |
Send message Joined: 27 Jan 05 Posts: 16 Credit: 790,158 RAC: 0 |
Yeah, down with slop crap. |
Send message Joined: 5 Aug 04 Posts: 126 Credit: 24,435,960 RAC: 23,907 |
Last PNW to come to my machine was on 12th Feb this year. It completed. Ok, I forgot to specify it's all the Windows-PNW-tasks crapping-out, under different OS like Linux this batch is possibly worse since this time it's an input-file-error while not sure on the source of error for the "no heartbeat"-tasks. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
Last PNW to come to my machine was on 12th Feb this year. It completed. That has been traced to a BOINC API bug, which - coincidentally - resurfaced today when an application developer from another project tripped over it. The bug doesn't affect all Windows machines. It only bites when BOINC v7 (and perhaps some of the very late BOINC v6.12.xx line) are installed 'as a service'. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Thanks, ritterm. Reported to project as presumed configuration error. ... and now reported to the originating scientist. There is a shared interest here, in that volunteers want to run good models and so do the scientists: so this is the sort of error that gets sorted out. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Is there any risk in continuing to work these jobs and accumulate a pile of compute errors? Does the project ever blacklist or withhold work from hosts that appear to be unreliable? I've been continuing to poll for these so I don't miss the return of what I'm hoping will be error-free tasks. Considering the project's one hour backoff and near-immediate failure of the tasks, it doesn't seem to me to be wasting a great deal of time and resources. I don't know, maybe I'm crazy... :D |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
If you have a restriction on your Internet connection then you may be eating into your allowance to no great purpose and at a rather higher rate than would apply if the tasks were valid. Otherwise, there's no harm. There is no concept, on this project, of an automatically-detected unreliable computer (or indeed a reliable one). There is the 'minussing' procedure but that's manual, in which badly behaved computers have their task download limit manually set to -1 as the result of a report here on the message boards. |
Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0 |
Progress, perhaps? Only a few minutes in on these new tasks that were created earlier today, but at least they didn't crash right away. Workunit 9031281 Workunit 9031276 |
©2024 cpdn.org