Message boards : Number crunching : Computation errors on a large number of Weather at Homes.
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
The last batch a couple of months ago, I had a computation error on maybe 20% of them, this time it looks like 50%. And it can happen anywhere in the task, some did it early on at around 15%, some right close to the end at 85%. Are they doing that for everyone? Anyone here able to analyse the returns from my machines to see what I'm doing wrong if anything? I don't get errors on any other project. I have 7 very different computers so I'm not blaming the hardware. They don't mind being paused when Boinc does other tasks do they? Or when I play a game? |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Nothing new. Read the other threads. You have been lucky, I errored out 100% of them. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
They don't mind being paused when Boinc does other tasks do they? Or when I play a game? Yes, they do mind. We tell people about this several times a month. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I've never been told, your communications aren't too good.They don't mind being paused when Boinc does other tasks do they? Or when I play a game? Why do they mind when other projects don't? And why do they not normally mind? They usually have to be paused several times before a crash. To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that. I've got CPDN on weight 1000000, the others are between 1 and 50, so it won't pause one to do something else unless the something else is a multi core task. I have "leave applications in memory" ticked. I'll cease downloading them for the games machine. The other 6 run Boinc 24/7. The only thing I can think of doing is manually editing the deadline for CPDN tasks as they come in. If I set it to make Boinc think they're going to be late, it'll run them no matter what, even at the same time as a 24 core task from another project. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Nothing wrong with my communications, it's just you. Read through the board to see what's been said about lots of things. For Suspending tasks, there's this thread: Suspending WUs safely And more in other places. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.In theory (right), BOINC will run the other projects in "high priority" mode if necessary to get them in on time. But I don't use Primegrid, so I don't know how it handles that I don't normally have a problem with just setting "switch between" to a large number. Or else, can't you just limit Primegrid with an app_config.xml to run on only 20 cores or so to leave a few free for CPDN? |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Nothing wrong with my communications, it's just you.I've read a fair bit of the forums, but never seen anything about this. I can't be expected to read everything in here! Something this important should be mentioned when you sign up, just like LHC warns you that you need to get Virtualbox and switch on VT-d in the BIOS. If you don't, their tasks error out in seconds, although you'd soon notice that. For Suspending tasks, there's this thread: Suspending WUs safelyYou said in that thread that it's ok to suspend if they're left in memory. I have all my computers set to leave them in memory. So why are they crashing? One of the machines has a game as an exclusive application, but I just checked and Boinc honours my request to leave them in memory. The memory is not freed up looking in the task manager. The other 6 machines do not pause for anything other than different projects within Boinc. Those still crash though. I also notice they're not crashing at the point of resuming. For example the games machine crashed one overnight, not when I stopped playing the game yesterday evening. And then it crashed another during this morning, after having run continuously for about 12 hours. All I can do is try to prevent suspending (despite this should be ok), by removing "exclusive applications" from Boinc on the games machine, I'll manually suspend all projects except CPDN when I play (if it's less than all 24 cores I doubt it'll impact the performance much, the game only needs 2 or 3), and by setting any CPDN tasks that come in to have a very short deadline to discourage other stuff from interfering (I'll do this only once there are less than all the cores in use for CPDN, or if there's nothing else in the buffer, that way it doesn't stop the other projects from completing tasks in the buffer). |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
The problem with that is CPDN will still get paused. Consider a 24 core machine with 16 CPDN tasks running. Once there's nothing left in the buffer to occupy those other 8 cores, it will try to download something. If that happens to be Primegrid, Milkyway, or LHC, those are multicore tasks. They will shove a CPDN out of the way, even if "switch between applications" is infinite.To sort this, is there any way I can tell Boinc to not pause CPDN tasks? I can certainly set the "switch between applications" to a very large number (it's currently 60 minutes which I think is the default), but that would cause projects to have tasks returned late due to a Primegrid task for example occupying all 24 cores for 2 weeks. Those don't mind being paused so I want to still allow that.In theory (right), BOINC will run the other projects in "high priority" mode if necessary to get them in on time. But I don't use Primegrid, so I don't know how it handles that Or else, can't you just limit Primegrid with an app_config.xml to run on only 20 cores or so to leave a few free for CPDN?Good suggestion, same as Richard Haselgrove mentioned over in the main Boinc forums. But it's not flexible. Say I limited it to 20 cores, what if next time I get CPDN tasks there are 8 of them? 4 will still get shoved out of the way every time a Primegrid task wants to run. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
From Sarah with regards to these tasks. (Batch 890) Hi all, Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
South Africa region is the most unstable as far as climate goes. The currents from the Indian Ocean collide with the currents of the South Atlantic. To add further spice, the cold Antarctic currents collide from below. Different sea temperatures. It is quite exciting navigating around both the Capes. Cape of Good Hope in this case and Cape Horn. South American continents Southern tip. There are other spots also like the coast off Chile famous for Al Nino effect. Then we have the Bay of Biscay where the Gulf Stream excites matters. I had been wondering the same thing, the currents are an unknown variable. Then you might have heard of The Roaring Forties to add further spice. https://wiki2.org/en/Roaring_Forties No need to kick the computers. As an aside, some of us have left the other projects and dedicated ourselves to CPDN. Running other projects with the long deadlines of CPDN does lead to clashes. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,985,010 RAC: 14,261 |
Of the four that I got one completed, one errored out after 6 zips, one after 7 and the last after 20! No discernable pattern. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Of the four that I got one completed, one errored out after 6 zips, one after 7 and the last after 20! No discernable pattern.I got just over 50. 1 has completed successfully, 14 are still running on the slower machines, and the rest crashed, mostly around 2/3rds of the way through (that 2/3rds might mean something?) |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
South Africa region is the most unstable as far as climate goes. The currents from the Indian Ocean collide with the currents of the South Atlantic. To add further spice, the cold Antarctic currents collide from below. Different sea temperatures. It is quite exciting navigating around both the Capes. Cape of Good Hope in this case and Cape Horn. South American continents Southern tip.I have more than one interest, so run many projects, also CPDN can't keep my Windows machines very busy. When CPDN (or one of the other rare projects I run) produces something, I get them to run with priority. Sometimes Boinc does need quite a shove to do what I ask, for example what I'm now going to do, as I mentioned earlier in this thread - even with CPDN at 1000 times more weighting than other projects, a 24 core task will push CPDN out of the way. I yearn for the time when programs just worked, nowadays we buy half finished products (I'm having a go at Boinc, not the CPDN program). |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
just one hard fail between the three recent eu batches. Still not enough to work out what is going on. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,985,010 RAC: 14,261 |
Not sure if significant but two of my current batch on Windows are repeats that failed on the same computer that is a Ryzen chip (seg fail after 1 zip). One is running OK on my i5 and got to 3rd zip. Other yet to start. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
That Ryzen has 65 running on a 12 core machine! And has crashed 7 more from the latest batches. :( |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
Still only one hard fail between the three most recent Windows batches and in that case, all three were on Intel processors. I am guessing at least another day or two before we get even a hint of statistical significance. Anyone know what percentage of machines are Intel compared with AMD? I have noticed before when poking around, there are also quite a few ARM powered machines attached to this project which will never get any work because they are not supported. (I am assuming this is also true for the OpenIFS tasks that have been around on and off in testing for a while.) Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Still only one hard fail between the three most recent Windows batches and in that case, all three were on Intel processors. I am guessing at least another day or two before we get even a hint of statistical significance. Anyone know what percentage of machines are Intel compared with AMD? I have noticed before when poking around, there are also quite a few ARM powered machines attached to this project which will never get any work because they are not supported. (I am assuming this is also true for the OpenIFS tasks that have been around on and off in testing for a while.) ______________ Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner. I just completed one successfully but I have not rebooted it for the last six days and got another WU on completion. I have marked it no further tasks so that I can reboot it and empty out my RAM from whatever things are left behind by programmes. See if you can find that data on WUProp@home. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,013,957 RAC: 21,195 |
Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner. I just completed one successfully but I have not rebooted it for the last six days and got another WU on completion. I have marked it no further tasks so that I can reboot it and empty out my RAM from whatever things are left behind by programmes. See if you can find that data on WUProp@home. The one hard fail has been on two i5s and one i7 but I would suggest we need at least ten hard fails to see if there is any pattern. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Which generation of Intel processors? On my 8th generation, they are not failing. On my 10th generation, they were failing and that has been sent into the naughty corner. I got a new machine just to run Windows10 so I could run the TaxACT program for my taxes. GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Since the machine is there anyway, I put BOINC on it and started it running ClimatePrediction and WCG, Rosetta,and Universe on it. It has accomplished only one CPDN task so far. Name wah2_sam50_a0dd_201312_25_880_012033489_1 Workunit 12033489 Created 26 Dec 2020, 7:27:13 UTC Sent 26 Dec 2020, 7:27:20 UTC Report deadline 8 Dec 2021, 12:47:20 UTC Received 1 Jan 2021, 7:35:21 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1512658 Run time 5 days 23 hours 0 min 51 sec CPU time 5 days 21 hours 48 min 23 sec Validate state Valid Credit 19,018.14 Device peak FLOPS 4.00 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Peak working set size 222.23 MB Peak swap size 188.70 MB Peak disk usage 75.02 MB |
©2024 cpdn.org