climateprediction.net (CPDN) home page
Thread 'Is there a problem with HADAM3P Jobs on some AMD Processors?'

Thread 'Is there a problem with HADAM3P Jobs on some AMD Processors?'

Message boards : Number crunching : Is there a problem with HADAM3P Jobs on some AMD Processors?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user587059

Send message
Joined: 3 Sep 09
Posts: 5
Credit: 509,410
RAC: 0
Message 43342 - Posted: 31 Oct 2011, 22:21:07 UTC

Two of my WinXP machines running AMD 64 3500+ processors are generating lots of errors with these jobs: 10 of the last 12 jobs ended in error on one (mostly with 0 credits) and 4 out of the last 6 on the other (all with 0 credit). One machine using a newer AMD 64 processor with Windows Vista and one using an Intel processor with WinXP aren't having problems. Are there known problems associated with older AMD 64 processors? Or have I just hit a stream of bad luck? (Lots of CPU cycles wasted in any event...)
ID: 43342 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43343 - Posted: 31 Oct 2011, 22:41:25 UTC - in response to Message 43342.  

The problem is more likely to be that those 2 computers, (and your Intels), are light on memory. You really need a minimum of 2 gigs to ensure that all of the processes have enough to co-exist without having to swap out to the hd all of the time.

And, judging by the number of BOINC suspends, you have them set for the default of Suspend work if CPU usage is above, so that the models are constantly being stopped and started. You can get away with this for a while, but sooner or later, this stopping/starting is going to coincide with a critical point in the program and it's going to pack it in. They are supercomputer programs after all, and not designed for this behaviour.


Backups: Here
ID: 43343 · Report as offensive     Reply Quote
old_user587059

Send message
Joined: 3 Sep 09
Posts: 5
Credit: 509,410
RAC: 0
Message 43346 - Posted: 1 Nov 2011, 4:17:41 UTC - in response to Message 43343.  

The problem is more likely to be that those 2 computers, (and your Intels), are light on memory. You really need a minimum of 2 gigs to ensure that all of the processes have enough to co-exist without having to swap out to the hd all of the time.

The two machines in question have single processor/single core CPUs and 1GB of memory. They are dedicated BOINC machines and have no other processes running except when I login to see what is going on. The task manager always indicates at least 200kB of free memory -- usually more.

And, judging by the number of BOINC suspends, you have them set for the default of Suspend work if CPU usage is above, so that the models are constantly being stopped and started.

I don't understand where you are getting that information -- both machines have the "suspend if work is above..." set to 0 (which means no restriction, as I understand it...). The machine info on this web site indicates that one is running BOINC 99.8046% of the time and the other 99.7719%.

You can get away with this for a while, but sooner or later, this stopping/starting is going to coincide with a critical point in the program and it's going to pack it in. They are supercomputer programs after all, and not designed for this behaviour.

Maybe so, but this stuff is being farmed out to volunteers running PCs, NOT supercomputers...

ID: 43346 · Report as offensive     Reply Quote
old_user587059

Send message
Joined: 3 Sep 09
Posts: 5
Credit: 509,410
RAC: 0
Message 43347 - Posted: 1 Nov 2011, 4:25:06 UTC - in response to Message 43343.  

You really need a minimum of 2 gigs to ensure that all of the processes have enough to co-exist without having to swap out to the hd all of the time.

If that's true, then the Technical Requirements page should be updated - it indicates that the minimum memory for WinXP systems is 256MB (!) -- and does not even mentioned Windows Vista or Windows 7!
ID: 43347 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43348 - Posted: 1 Nov 2011, 4:49:18 UTC - in response to Message 43346.  

The two machines in question have single processor/single core CPUs and 1GB of memory.
That's not what is shown here

They are dedicated BOINC machines and have no other processes running
Except OS related, antivirus, etc.
And when you look at what the models are doing, this will use ram. Which will suddenly decrease what's available for the modelling.
Do they have a separate graphics card with it's own display memory, or is it an on-board chip, using some of the main memory?

I don't understand where you are getting that information
I'm getting it from the error messages of your models. Here for instance. Click on the + alongside stderr.

Maybe so, but this stuff is being farmed out to volunteers running PCs, NOT supercomputers...
And most people have no problems in running the models on pcs. :)

But what ever is wrong, it's NOT an AMD 'thing'.

The Technical Requirements page was written way back when the only models available were the original (slab) models.
I'll add 'an update needed' to the work load of the project people.


Backups: Here
ID: 43348 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 43349 - Posted: 1 Nov 2011, 13:32:11 UTC

You say that no other processes are running on those PCs while cpdn is. Do they have anti-spyware/anti-virus checkers running? If so, what type of such software may be running. A bunch of the failed results have error messages related to a file being open while being written to.
ID: 43349 · Report as offensive     Reply Quote
old_user587059

Send message
Joined: 3 Sep 09
Posts: 5
Credit: 509,410
RAC: 0
Message 43355 - Posted: 1 Nov 2011, 21:39:15 UTC - in response to Message 43348.  

Maybe so, but this stuff is being farmed out to volunteers running PCs, NOT supercomputers...
And most people have no problems in running the models on pcs. :)

But what ever is wrong, it's NOT an AMD 'thing'.

The Technical Requirements page was written way back when the only models available were the original (slab) models.
I'll add 'an update needed' to the work load of the project people.

And I am only having problems on two of the six machines (under two different user IDs, in case you check and tell me that I am wrong...) attached to this project. So, if they keep erroring-out, I will simply detatch them and let them work on other projects that they can handle OK. Case closed.
ID: 43355 · Report as offensive     Reply Quote

Message boards : Number crunching : Is there a problem with HADAM3P Jobs on some AMD Processors?

©2024 cpdn.org