climateprediction.net (CPDN) home page
Thread 'Computation errors on a large number of Weather at Homes.'

Thread 'Computation errors on a large number of Weather at Homes.'

Message boards : Number crunching : Computation errors on a large number of Weather at Homes.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63257 - Posted: 4 Jan 2021, 12:13:34 UTC
Last modified: 4 Jan 2021, 12:14:09 UTC

The last batch a couple of months ago, I had a computation error on maybe 20% of them, this time it looks like 50%. And it can happen anywhere in the task, some did it early on at around 15%, some right close to the end at 85%. Are they doing that for everyone? Anyone here able to analyse the returns from my machines to see what I'm doing wrong if anything? I don't get errors on any other project. I have 7 very different computers so I'm not blaming the hardware. They don't mind being paused when Boinc does other tasks do they? Or when I play a game?
ID: 63257 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63258 - Posted: 4 Jan 2021, 12:47:33 UTC - in response to Message 63257.  

Nothing new. Read the other threads. You have been lucky, I errored out 100% of them.
ID: 63258 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63259 - Posted: 4 Jan 2021, 13:04:06 UTC - in response to Message 63257.  

They don't mind being paused when Boinc does other tasks do they? Or when I play a game?

Yes, they do mind.
We tell people about this several times a month.
ID: 63259 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63260 - Posted: 4 Jan 2021, 13:56:27 UTC - in response to Message 63259.  
Last modified: 4 Jan 2021, 14:13:04 UTC

They don't mind being paused when Boinc does other tasks do they? Or when I play a game?

Yes, they do mind.
We tell people about this several times a month.
I've never been told, your communications aren't too good.

Why do they mind when other projects don't? And why do they not normally mind? They usually have to be paused several times before a crash.

To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.

I've got CPDN on weight 1000000, the others are between 1 and 50, so it won't pause one to do something else unless the something else is a multi core task.

I have "leave applications in memory" ticked. I'll cease downloading them for the games machine. The other 6 run Boinc 24/7.

The only thing I can think of doing is manually editing the deadline for CPDN tasks as they come in. If I set it to make Boinc think they're going to be late, it'll run them no matter what, even at the same time as a 24 core task from another project.
ID: 63260 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63261 - Posted: 4 Jan 2021, 14:13:35 UTC - in response to Message 63260.  

Nothing wrong with my communications, it's just you.
Read through the board to see what's been said about lots of things.

For Suspending tasks, there's this thread: Suspending WUs safely

And more in other places.
ID: 63261 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63262 - Posted: 4 Jan 2021, 14:16:48 UTC - in response to Message 63260.  
Last modified: 4 Jan 2021, 14:19:15 UTC

To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.
In theory (right), BOINC will run the other projects in "high priority" mode if necessary to get them in on time. But I don't use Primegrid, so I don't know how it handles that
I don't normally have a problem with just setting "switch between" to a large number.

Or else, can't you just limit Primegrid with an app_config.xml to run on only 20 cores or so to leave a few free for CPDN?
ID: 63262 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63263 - Posted: 4 Jan 2021, 15:19:31 UTC - in response to Message 63261.  
Last modified: 4 Jan 2021, 16:05:20 UTC

Nothing wrong with my communications, it's just you.
I've read a fair bit of the forums, but never seen anything about this. I can't be expected to read everything in here! Something this important should be mentioned when you sign up, just like LHC warns you that you need to get Virtualbox and switch on VT-d in the BIOS. If you don't, their tasks error out in seconds, although you'd soon notice that.

For Suspending tasks, there's this thread: Suspending WUs safely

And more in other places.
You said in that thread that it's ok to suspend if they're left in memory. I have all my computers set to leave them in memory. So why are they crashing? One of the machines has a game as an exclusive application, but I just checked and Boinc honours my request to leave them in memory. The memory is not freed up looking in the task manager. The other 6 machines do not pause for anything other than different projects within Boinc. Those still crash though. I also notice they're not crashing at the point of resuming. For example the games machine crashed one overnight, not when I stopped playing the game yesterday evening. And then it crashed another during this morning, after having run continuously for about 12 hours.

All I can do is try to prevent suspending (despite this should be ok), by removing "exclusive applications" from Boinc on the games machine, I'll manually suspend all projects except CPDN when I play (if it's less than all 24 cores I doubt it'll impact the performance much, the game only needs 2 or 3), and by setting any CPDN tasks that come in to have a very short deadline to discourage other stuff from interfering (I'll do this only once there are less than all the cores in use for CPDN, or if there's nothing else in the buffer, that way it doesn't stop the other projects from completing tasks in the buffer).
ID: 63263 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63264 - Posted: 4 Jan 2021, 15:36:10 UTC - in response to Message 63262.  
Last modified: 4 Jan 2021, 15:40:44 UTC

To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.
In theory (right), BOINC will run the other projects in "high priority" mode if necessary to get them in on time. But I don't use Primegrid, so I don't know how it handles that
I don't normally have a problem with just setting "switch between" to a large number.
The problem with that is CPDN will still get paused. Consider a 24 core machine with 16 CPDN tasks running. Once there's nothing left in the buffer to occupy those other 8 cores, it will try to download something. If that happens to be Primegrid, Milkyway, or LHC, those are multicore tasks. They will shove a CPDN out of the way, even if "switch between applications" is infinite.

Or else, can't you just limit Primegrid with an app_config.xml to run on only 20 cores or so to leave a few free for CPDN?
Good suggestion, same as Richard Haselgrove mentioned over in the main Boinc forums. But it's not flexible. Say I limited it to 20 cores, what if next time I get CPDN tasks there are 8 of them? 4 will still get shoved out of the way every time a Primegrid task wants to run.
ID: 63264 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 63265 - Posted: 4 Jan 2021, 16:24:21 UTC

From Sarah with regards to these tasks. (Batch 890)

Hi all,

Sorry for the delayed response over Christmas. Just to let you know that I have stopped resends going out from this batch as thee failures are very high.

The experiment setup and files etc are the same as we have used for other weather@home regions so I wasn’t expecting this degree of error. I know that the South Africa region is not as stable though as other regions and I am wondering if this is what is causing the problems in this case - it would be consistent with just an initial conditions difference causing an error as Ian reported.

I will check that I picked up the best restart file to use and also what the researchers want to do.

Best wishes and Happy New Year,

Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63265 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63266 - Posted: 4 Jan 2021, 22:33:39 UTC - in response to Message 63265.  

South Africa region is the most unstable as far as climate goes. The currents from the Indian Ocean collide with the currents of the South Atlantic. To add further spice, the cold Antarctic currents collide from below. Different sea temperatures. It is quite exciting navigating around both the Capes. Cape of Good Hope in this case and Cape Horn. South American continents Southern tip.
There are other spots also like the coast off Chile famous for Al Nino effect. Then we have the Bay of Biscay where the Gulf Stream excites matters.
I had been wondering the same thing, the currents are an unknown variable. Then you might have heard of The Roaring Forties to add further spice. https://wiki2.org/en/Roaring_Forties
No need to kick the computers. As an aside, some of us have left the other projects and dedicated ourselves to CPDN. Running other projects with the long deadlines of CPDN does lead to clashes.
ID: 63266 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,363,137
RAC: 15,665
Message 63267 - Posted: 4 Jan 2021, 23:33:49 UTC - in response to Message 63266.  

Of the four that I got one completed, one errored out after 6 zips, one after 7 and the last after 20! No discernable pattern.
ID: 63267 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63272 - Posted: 5 Jan 2021, 18:44:43 UTC - in response to Message 63267.  

Of the four that I got one completed, one errored out after 6 zips, one after 7 and the last after 20! No discernable pattern.
I got just over 50. 1 has completed successfully, 14 are still running on the slower machines, and the rest crashed, mostly around 2/3rds of the way through (that 2/3rds might mean something?)
ID: 63272 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63273 - Posted: 5 Jan 2021, 18:48:58 UTC - in response to Message 63266.  
Last modified: 5 Jan 2021, 18:49:50 UTC

South Africa region is the most unstable as far as climate goes. The currents from the Indian Ocean collide with the currents of the South Atlantic. To add further spice, the cold Antarctic currents collide from below. Different sea temperatures. It is quite exciting navigating around both the Capes. Cape of Good Hope in this case and Cape Horn. South American continents Southern tip.
There are other spots also like the coast off Chile famous for Al Nino effect. Then we have the Bay of Biscay where the Gulf Stream excites matters.
I had been wondering the same thing, the currents are an unknown variable. Then you might have heard of The Roaring Forties to add further spice. https://wiki2.org/en/Roaring_Forties
No need to kick the computers. As an aside, some of us have left the other projects and dedicated ourselves to CPDN. Running other projects with the long deadlines of CPDN does lead to clashes.
I have more than one interest, so run many projects, also CPDN can't keep my Windows machines very busy. When CPDN (or one of the other rare projects I run) produces something, I get them to run with priority. Sometimes Boinc does need quite a shove to do what I ask, for example what I'm now going to do, as I mentioned earlier in this thread - even with CPDN at 1000 times more weighting than other projects, a 24 core task will push CPDN out of the way. I yearn for the time when programs just worked, nowadays we buy half finished products (I'm having a go at Boinc, not the CPDN program).
ID: 63273 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 63375 - Posted: 22 Jan 2021, 11:22:51 UTC

just one hard fail between the three recent eu batches. Still not enough to work out what is going on.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63375 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,363,137
RAC: 15,665
Message 63382 - Posted: 22 Jan 2021, 23:48:45 UTC - in response to Message 63375.  

Not sure if significant but two of my current batch on Windows are repeats that failed on the same computer that is a Ryzen chip (seg fail after 1 zip). One is running OK on my i5 and got to 3rd zip. Other yet to start.
ID: 63382 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63383 - Posted: 23 Jan 2021, 0:44:51 UTC - in response to Message 63382.  

That Ryzen has 65 running on a 12 core machine!
And has crashed 7 more from the latest batches. :(
ID: 63383 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 63386 - Posted: 23 Jan 2021, 7:37:59 UTC

Still only one hard fail between the three most recent Windows batches and in that case, all three were on Intel processors. I am guessing at least another day or two before we get even a hint of statistical significance. Anyone know what percentage of machines are Intel compared with AMD? I have noticed before when poking around, there are also quite a few ARM powered machines attached to this project which will never get any work because they are not supported. (I am assuming this is also true for the OpenIFS tasks that have been around on and off in testing for a while.)
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63386 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63387 - Posted: 23 Jan 2021, 15:14:45 UTC - in response to Message 63386.  

Still only one hard fail between the three most recent Windows batches and in that case, all three were on Intel processors. I am guessing at least another day or two before we get even a hint of statistical significance. Anyone know what percentage of machines are Intel compared with AMD? I have noticed before when poking around, there are also quite a few ARM powered machines attached to this project which will never get any work because they are not supported. (I am assuming this is also true for the OpenIFS tasks that have been around on and off in testing for a while.)

______________
Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner. I just completed one successfully but I have not rebooted it for the last six days and got another WU on completion. I have marked it no further tasks so that I can reboot it and empty out my RAM from whatever things are left behind by programmes. See if you can find that data on WUProp@home.
ID: 63387 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 63388 - Posted: 23 Jan 2021, 16:12:02 UTC - in response to Message 63387.  

Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner. I just completed one successfully but I have not rebooted it for the last six days and got another WU on completion. I have marked it no further tasks so that I can reboot it and empty out my RAM from whatever things are left behind by programmes. See if you can find that data on WUProp@home.


The one hard fail has been on two i5s and one i7 but I would suggest we need at least ten hard fails to see if there is any pattern.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63388 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63389 - Posted: 23 Jan 2021, 17:22:42 UTC - in response to Message 63387.  

Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner.


I got a new machine just to run Windows10 so I could run the TaxACT program for my taxes.

GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]

Since the machine is there anyway, I put BOINC on it and started it running ClimatePrediction and WCG, Rosetta,and Universe on it. It has accomplished only one CPDN task so far.

Name 	wah2_sam50_a0dd_201312_25_880_012033489_1
Workunit 	12033489
Created 	26 Dec 2020, 7:27:13 UTC
Sent 	        26 Dec 2020, 7:27:20 UTC
Report deadline 8 Dec 2021, 12:47:20 UTC
Received 	1 Jan 2021, 7:35:21 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1512658
Run time 	5 days 23 hours 0 min 51 sec
CPU time 	5 days 21 hours 48 min 23 sec
Validate state 	Valid
Credit 	19,018.14
Device peak FLOPS 	4.00 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Peak working set size 	222.23 MB
Peak swap size 	188.70 MB
Peak disk usage 	75.02 MB

ID: 63389 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Computation errors on a large number of Weather at Homes.

©2024 cpdn.org