climateprediction.net (CPDN) home page
Thread 'WUs constantly failing'

Thread 'WUs constantly failing'

Message boards : Number crunching : WUs constantly failing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19684 - Posted: 27 Jan 2006, 11:59:23 UTC - in response to Message 19683.  

You could try removing BOINC from the startup folder, so that, when your brother turns on the computer to play games, BOINC doesn\'t start.
When you want to run BOINC, start it manually by clicking on the boincmgr icon in the BOINC folder.



Yes i know :)

But for now it\'s fine in this way, i make a boinc backup every morning ;)

I want to keep the model crashing, so maybe i can find something useful to help programmers fixing issues like this. Because avoiding the start of boinc, or to make a backup is a suitable way for expert people, not for the normal user. In special mode when the workunits last several months :)

Now i have to find a way to avoid the cleanup after the model crash, to find the error in the yabds.out.

In the working model, in yabds.out there are errors similar to those on db\'s post, if i remember i\'ll copy them here tomorrow.
ID: 19684 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 19688 - Posted: 27 Jan 2006, 14:24:16 UTC - in response to Message 19673.  
Last modified: 27 Jan 2006, 14:25:02 UTC


One of my machines has crashed out three times over the last 10 days or so. In each case its -161. I won\'t have access to the machine again until the weekend, but I\'ll look in the yabsd file then.

This is due to a batch of bad WU\'s sent out previously.
Its been resolved. any new WU\'s you get will be ok.


I don\'t think this is the case. If you look at the machine (325133), you will see that the most recent crash was on a model issued on 16th Jan, long after the batch of bad WUs was resolved.
ID: 19688 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 19703 - Posted: 27 Jan 2006, 20:15:09 UTC - in response to Message 19688.  

This is due to a batch of bad WU\'s sent out previously.
Its been resolved. any new WU\'s you get will be ok.


I don\'t think this is the case. If you look at the machine (325133), you will see that the most recent crash was on a model issued on 16th Jan, long after the batch of bad WUs was resolved.

I think Tolu meant another bad batch, but I could be wrong.
ID: 19703 · Report as offensive     Reply Quote
ProfileThePhantom86
Avatar

Send message
Joined: 6 Aug 04
Posts: 42
Credit: 3,693,897
RAC: 3,475
Message 19847 - Posted: 1 Feb 2006, 8:03:59 UTC

Looks like I\'ve been hit with these as well. This host.
ID: 19847 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 19852 - Posted: 1 Feb 2006, 11:08:39 UTC

Me too ... I just killed two .. don\'t feel bad, i notice that they both failed for someone else too ...
ID: 19852 · Report as offensive     Reply Quote
old_user3434
Avatar

Send message
Joined: 30 Aug 04
Posts: 77
Credit: 1,785,934
RAC: 0
Message 19886 - Posted: 2 Feb 2006, 7:44:57 UTC - in response to Message 19852.  

Pretty easy for me, of 24 machines, not a single one so far has managed to process Sulphur correctly.

Effectively I\'ve suspended CPDN until they fix the recent, enourmous Problems with their Clients, nothing else to do (hughe waste of resources otherwise) :(
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
ID: 19886 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19887 - Posted: 2 Feb 2006, 8:30:31 UTC
Last modified: 2 Feb 2006, 8:49:33 UTC

why my post has been deleted? maybe it was long? or i should post it in the phpbb forum?
ID: 19887 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19888 - Posted: 2 Feb 2006, 8:50:33 UTC

Regretfully you mentioned a certain name associated with a certain, soon to be released, project. Mods & Admins have been notified that this name & project are not to be mentioned, due to legal requirements, so I had to delete your post. Lots of mine, from before I found out, have also been deleted.

Actually, they are not \"deleted\", just hidden, and in a week or two, when it is officially announced, I\'ll go through them, and \'return\' them to view.

As for your offer to help with the testing, this is being done by people with known records of being able to complete models. There is no credit, just \'destructive testing\' of all the options with a computer that is known to work with spinup, which is also a difficult testing process. After this, the user has the option to continue, to produce some starting data for \"IT\", or return to spinup.
It is hoped that mentions of \"IT\" on sites outside the control of this site do not attract undue attention.
And I hope that people reading these boards do not persist in posting about this matter.

Anyone whose computers can\'t complete a sulphur model can wait for the coupled model in a month or two, or concentrate on other projects for a while.

ID: 19888 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 19889 - Posted: 2 Feb 2006, 9:42:56 UTC
Last modified: 2 Feb 2006, 9:44:09 UTC

Well, I have been doing pretty well I thought. But one computer now seems to have had several in a row. I am going to try to get one more to see ...

I had been completing and had completed several SLab models with that computer and the first of the three deaths seems to be the cross from phase one to two ... I think that was mentioned as an issue.

The last two were \"819\" traps on start up. Which I find interesting as I don\'t run the graphics and that error trap is USUALLY an indication of a video card issue ... well I will try one more I guess ...

==== edit

Forgot to mention that soneone else also tried the last two models and also got an \"819\" error, though one DID run for a bit ... interesting ...

Really strange ...
ID: 19889 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19890 - Posted: 2 Feb 2006, 9:58:23 UTC

ok, i din\'t knew this :) so I copy - filter - paste the old message that if you wish you can delete or keep hidden :)

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=1108852

Same workunit :)

as i said before, i let this workunit to crash to gather some useful info for avoiding crashes :).

in one crash the yabsd.out was still present, in the last part there was this :
FIXED LENGTH HEADER
-------------------
Dump format version-32768
UM Version No 401
Atmospheric data
On hybrid levels
Over global domain
Ancillary dataset
Exp No =-32768 Run Id =-32768
360-day calendar
Arakawa B grid
Year Month Day Hour Min Sec DayNo
Data time = 0 1 16 0 0 0 0
Validity time = 0 12 16 0 0 0 0
Creation time = 0 1 0 0 0 0 0
Start 1st dim 2nd dim 1st parm 2nd parm
Integer Consts 257 15 15
Real Consts 272 6 6
Level Dep Consts -32768 1 1 1 1
Row Dep Consts -32768 1 1 1 1
Column Dep Consts -32768 1 1 1 1
Fields of Consts -32768 1 1 1 1
Extra Consts -32768 1 1
History Block -32768 1 1
CFI No 1 -32768 1 1
CFI No 2 -32768 1 1
CFI No 3 -32768 1 1
Lookup Tables 278 64 912 64 912
Model Data 58881 6391296 6391296

LOOKUP TABLE
58368 64-bit words long
ANCILLARY_STEPSim(s_im) 5
INITMOS : MOS_OUTPUT_LENGTH = 1129
im,sm,ngroup,new_im,new_sm 1 1 48 T F
PPCTL: Opening preattached file on unit 60
PPCTL: Opening preattached file on unit 61
PPCTL: Opening preattached file on unit 62

PP_CTL: Error Buffering in Fixed length Header
Empty PP File in Climate Mode?

Error code = 0.00
Length requested = 0
Length actually transferred = 256
PPCTL: Opening preattached file on unit 63
PPCTL: Opening preattached file on unit 64
PPCTL: Opening preattached file on unit 65
PPCTL: Opening preattached file on unit 66
PPCTL: Opening preattached file on unit 67
PPCTL: Opening preattached file on unit 68

in the last crash there was only the stderr_um.txt file with this:

BUFFIN: C I/O Error - Return code = 16

naturally i backup everything so the climate model continue to advance and as you can see my machine continue to trickle :)

I don\'t think is a workunit problem, but a application problem that should be solved because how you can tell to normal people, that before playing some games or make something with an heavy load, that they must backup the boinc folder or shut down boinc?


Best Regards
Luigi
ID: 19890 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 19891 - Posted: 2 Feb 2006, 11:59:17 UTC

It did not take long ... error and it failed for the other participant on start up ...

Are we sure we are done with the bad work units pending?

Well tomorrow is another day. I looked in my account, I thought I had done more sulfur, but so far have only successfully completed one. But, I do have another that is only 2 days from completion and it runs continuously, so, theory says that should be a good estimate (though I am not sure ... probably will take 3-4 days).

Regardless, I have one more coming, the next one after that has 16 days to run ...
ID: 19891 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19892 - Posted: 2 Feb 2006, 12:02:03 UTC

I\'m a programmer so i know that to report a possible bug it\'s better to give more details :)

My computer is an athlon 64 3200+ 754 pin 0,13u 2Ghz
motherboard abit kv8 with latest bios
1 GB of ram (2 ddr400 modules)

addon boards
DVB-S Board - skystar 2
Pinnacle board PCI-500
a standard realtek ethernet board
no sound card, using integrated one.

standard clock, also the memory timing are from the SPD settings.

I have no power supply problem, i have an enermax power supply(i don\'t remember the model :) )

no problem with the cpu overheating, i\'m using hyper 6 from cooler master, 950g of laminated copper :)

I have latest stable drivers of everything, andthe system is completely stable

I\'m running boinc version 5.2.13. with normal installation (no service) with automatic start

Os Windows Xp pro SP2 without any additional update.

How to reproduce these issue, it\'s simple:
1) you need a brother (maybe it\'s not strictly necessary :) )
2) turn on computer.
3) wait until the logon of windows appear
4) logon
5) Start a standard game, in this case Splinter cell 1
6) After 30 min or 1 hour you exit from game
7) Model crashed

This weekend i\'ll try to reproduce the model crashing myself to gather more specific details. And i want to try also if i can reproduce with another computer.
ID: 19892 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 19896 - Posted: 2 Feb 2006, 14:00:54 UTC

> 5) Start a standard game, in this case Splinter cell 1
> 6) After 30 min or 1 hour you exit from game
> 7) Model crashed

I would recommend you to shut down BOINC or at least suspend computation while gaming!
ID: 19896 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 19904 - Posted: 2 Feb 2006, 16:41:31 UTC

Well, I wish mine were that simple. Stable, single use platform, BOINC only ... heck, it is so single use I usually only look at it through RealVNC as there is no need to go local ... :)
ID: 19904 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19906 - Posted: 2 Feb 2006, 17:23:43 UTC - in response to Message 19896.  

> 5) Start a standard game, in this case Splinter cell 1
> 6) After 30 min or 1 hour you exit from game
> 7) Model crashed

I would recommend you to shut down BOINC or at least suspend computation while gaming!


Yes i know it, at least for climateprediction. But you, I and several thousand of people can do this, but there are millions out there that can\'t do this.

To let the boinc platform be more attractive to normal users, must be more reliable also if you are playing a game :).

the average computer user, is capable to surf in internet, write an email and to install a program, of the other thing he doesn\'t care. I look on the forums, in the server status, results page and so on several times in a day, you can say that I\'m a boinc addicted person :)

An example of an ipothetical non expert user.

1) one friend tell him that can use is spare cpu time for something usefull.
2) He thinks \"why not?\"
3) install the boinc client (if he is capable)
4) He choose the projects he likes (now is better than before, but i\'m waiting for account managers :) )
5) He is sure that it don\'t need his attention and he forget about the boinc existence for a month
6) after that because for 1 Hour a day plays his favourite game, in a month has lost 30 climate models,lost time, wasted server resources, and no science done.
7) Deleted boinc and user lost.

My first DC project was UD, and i liked of it that it was an install and forget program, then over 1 year ago i switched to boinc because i liked its philosofy.
To make an example, I keep the UD client on a friend\'s computer where i have very infrequent access. I would like to install boinc as soon as i can, but for now to manage a remote client with dynamic ip it\'s a **** ** *** ***

P.S.
I\'m sorry for my bad english :(
ID: 19906 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 19911 - Posted: 2 Feb 2006, 20:24:41 UTC

> but for now to manage a remote client with dynamic ip it\'s a **** ** *** ***

Heared about dynamic DNS services like http://dyn.dns.org?
ID: 19911 · Report as offensive     Reply Quote
Profileold_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 19913 - Posted: 2 Feb 2006, 21:48:07 UTC

Perhaps for your friend a different project might be more appropriate. I know this sounds like heresy ... but, not all project are suitable for all computers and all people.

I have had decent luck running CPDN on all my PCs, occassional model crashes for various reasong, but, aa pretty decent track record. Heck I am about to complete my second Sulfur model in a coupld days (1 day 12 hours).

But, though you would think that it would be a better computer to run CPDN I have yet to complete a model on my PowerMac G5 ... bad computer? Bad program, gremlins? who knows. But, I just stopped and now run other projects on the PowerMac, it really shines at Einstein@Home ...

Again, this is the beauty of BOINC ...

Oh, and WCG uses the UD program if you like, or you can run thier two projects under BOINC like I do ...
ID: 19913 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 19917 - Posted: 2 Feb 2006, 23:04:39 UTC

I occasionally run Doom while BOINC is running. The only problem is when it starts to benchmark. Then Doom slows right down and movement gets jerky. At least for me. The baddies seem to keep going. :(
When/if I wake up to it, I suspend Doom until the benchmark is finished.


ID: 19917 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19935 - Posted: 3 Feb 2006, 8:12:09 UTC - in response to Message 19913.  

Perhaps for your friend a different project might be more appropriate. I know this sounds like heresy ... but, not all project are suitable for all computers and all people.

I have had decent luck running CPDN on all my PCs, occassional model crashes for various reasong, but, aa pretty decent track record. Heck I am about to complete my second Sulfur model in a coupld days (1 day 12 hours).

But, though you would think that it would be a better computer to run CPDN I have yet to complete a model on my PowerMac G5 ... bad computer? Bad program, gremlins? who knows. But, I just stopped and now run other projects on the PowerMac, it really shines at Einstein@Home ...

Again, this is the beauty of BOINC ...

Oh, and WCG uses the UD program if you like, or you can run thier two projects under BOINC like I do ...


For now i run climateprediction only in my home computer, where there is no internet access, it\'s the best project for a computer like this. I only have to backup boinc folder every day and once in a while transport with cdrw at work.

in my friend\'s computer i\'ll install boinc with WCG and einstein i think, surely i\'ll not install climateprediction. I like this project but needs to much user attentions.

BTW with my home computer i managed to do an old slab model, without backups and with a lot of gaming ;) http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=250218
ID: 19935 · Report as offensive     Reply Quote
old_user19523

Send message
Joined: 20 Sep 04
Posts: 14
Credit: 30,765
RAC: 0
Message 19936 - Posted: 3 Feb 2006, 9:39:12 UTC - in response to Message 19917.  

I occasionally run Doom while BOINC is running. The only problem is when it starts to benchmark. Then Doom slows right down and movement gets jerky. At least for me. The baddies seem to keep going. :(
When/if I wake up to it, I suspend Doom until the benchmark is finished.




Not every game eats 100% cpu, for example when i play with pes 5 the model continue to advance because the game doesn\'t need much cpu time :)

many games simply do this:

while (1) {
continue; // :)
}
ID: 19936 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : WUs constantly failing

©2024 cpdn.org