Thread 'New work Discussion'

Author	Message
KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63173 - Posted: 24 Dec 2020, 16:34:29 UTC - in response to Message 63172. Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. ID: 63173 ·

Bill F Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,030,323 RAC: 2,771	Message 63174 - Posted: 25 Dec 2020, 1:19:53 UTC - in response to Message 63173. Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. Windows and Linux have different requirements for running CPDN and different recommendations for a "better" chance of success on WU's .... These are generalizations not subject to issues of the different Tasks being distributed at any given time. The message boards have several treads to look at here are just a couple Linux Libraries https://www.cpdn.org/forum_thread.php?id=7828#49056 Memory recommendations https://www.cpdn.org/forum_thread.php?id=8185#53062 BOINC Settings https://www.cpdn.org/forum_thread.php?id=7931#50571 Bill F ID: 63174 ·

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,992,465 RAC: 14,585	Message 63175 - Posted: 25 Dec 2020, 1:20:49 UTC - in response to Message 63167. I snagged 3 on their second attempt (without trying). One errored out after 6 zips but the other two are up to 16 and going strong (fingers and other digits crossed). BTW. Happy Xmas one and all! ID: 63175 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63176 - Posted: 25 Dec 2020, 1:48:36 UTC - in response to Message 63175. We haven't heard back from the project, but it's possible that this batch is running right near the edge of safe parameter space. ID: 63176 ·

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 63177 - Posted: 25 Dec 2020, 2:53:21 UTC - in response to Message 63176. Of the Windows regional models at 50 km resolution, the SAFR region batches have a relatively lower success rate than the other regions. I have no idea why, but as Les said, they may be running these experiments on that region with parameters that are closer to the edge of instability. I remember having quite a number of signal 11 failures with those earlier SAFR batches. ID: 63177 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63178 - Posted: 25 Dec 2020, 5:35:12 UTC - in response to Message 63177. I remember having quite a number of signal 11 failures with those earlier SAFR batches. Assuming it is a pushing the physics issue, it is unfortunate the error message isn't more informative. To put some figures on it it is looking like twice as many have hard failed as succeeded so far. I haven't looked at the successes to see if any of them failed first time around with the signal 11. As the fails seem to be doing so after the 6th month, it may be too early for that anyway. Success: 30 (1%) Fails: 1804 (52%) Hard Fail: 67 (2%) Running: 3403 (97%) Unsent: 0 (0%) Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63178 ·

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63179 - Posted: 25 Dec 2020, 7:39:46 UTC - in response to Message 63174. Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. Windows and Linux have different requirements for running CPDN and different recommendations for a "better" chance of success on WU's .... These are generalizations not subject to issues of the different Tasks being distributed at any given time. The message boards have several treads to look at here are just a couple Linux Libraries https://www.cpdn.org/forum_thread.php?id=7828#49056 Memory recommendations https://www.cpdn.org/forum_thread.php?id=8185#53062 BOINC Settings https://www.cpdn.org/forum_thread.php?id=7931#50571 Bill F -------------------------------- My Linux systems are behaving. It is my Windows systems with these new WU's. I am keeping an eye on them now and I noted that one crashed at 99%. They seem to be crashing right at the end; anyway I am getting more of these WU's and all have been run at least once. They can keep coming. I am at peace. ID: 63179 ·

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63180 - Posted: 25 Dec 2020, 9:35:35 UTC If I may ask, what does "Signal 11" mean? ID: 63180 ·

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 63181 - Posted: 25 Dec 2020, 11:30:34 UTC - in response to Message 63180. If I may ask, what does "Signal 11" mean? A segmentation fault is a memory error (Wikipedia). This might come from a computation that leads to an array index that isn't checked by the software itself before trying to access the indexed data (for understandable performance reasons). In other words, the temptation is to suppose a hardware memory error but if lots of models are failing, as for this batch, then it looks more like a programming/parameter problem. However, there have also been error messages reported by BOINC applications in which an error number is generated by, say, FORTRAN but is then reported by the BOINC application as if the error number was from the BOINC error world (C, Linux etc.). In such a case the error number is valid but the text reported by the BOINC Manager is not. It's a long time since I systematically investigated this kind of thing - because, happily, my models almost never crash any more - but maybe some BOINC people might have a better answer or the project developers themselves. ID: 63181 ·

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 63182 - Posted: 26 Dec 2020, 2:03:44 UTC - in response to Message 63181. Thank you. ID: 63182 ·

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 63198 - Posted: 28 Dec 2020, 15:21:08 UTC ... having said what I did about "Signal 11" I am now a bit surprised that a reissue that previously failed with a "Signal 11" on someone else's machine has now completed on my machine. If it was a parameter error then that shouldn't happen. [Oops - engaged brain: the other machine was AMD and mine is Intel: butterfly flaps its wings in the Amazon etc. - so the model development would have been different on the two machines even if the parameters are the same.] ID: 63198 ·

JTM Send message Joined: 12 Oct 15 Posts: 2 Credit: 7,602,290 RAC: 0	Message 63282 - Posted: 6 Jan 2021, 12:17:21 UTC - in response to Message 63161. Good afternoon all, I'm new to Linux and have no 'deep' experience with BOINC - I'm just a client user, go easy on me! I'm in the same boat as some others here, having installed BOINC on a Linux VM running on one of my windows PCs and receiving no tasks. I have checked the 'no_alt_platform' parameter is zero, installed the 32-bit libraries and set up the VM with 4 cores (out of 32) and 16GB RAM (also from 32, I can shunt some around between machines as needed once up and running). Can I ask if there are any updates on this topic, or proposed solutions? I've checked parameters, removed and reattached the project already. Thanks! J. ID: 63282 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63283 - Posted: 6 Jan 2021, 13:29:30 UTC - in response to Message 63282. Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced. Alse, this may be nothing to do with your issue but do make sure that when requesting work manually via the update button you wait at least an hour after the last request otherwise a setting on the server will send a message that the last request was too recent. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63283 ·

JTM Send message Joined: 12 Oct 15 Posts: 2 Credit: 7,602,290 RAC: 0	Message 63284 - Posted: 6 Jan 2021, 14:22:15 UTC - in response to Message 63283. Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced. Alse, this may be nothing to do with your issue but do make sure that when requesting work manually via the update button you wait at least an hour after the last request otherwise a setting on the server will send a message that the last request was too recent. Thanks for the response, though my VM client is now chewing on two tasks! I restarted my VM less than an hour before the successful fetch, so I don't have the error log from a failed fetch. From memory, there was nothing obvious in the log for a failure, just 'got 0 new tasks' and 'project requested delay of 3636 seconds'. The last change I made was to reload the 32-bit libraries following a post elsewhere in the forum, so perhaps this was my issue? Seems to be solved now, thanks again. J. ID: 63284 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63285 - Posted: 6 Jan 2021, 15:40:41 UTC - in response to Message 63284. Glad you got it sorted. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63285 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63302 - Posted: 9 Jan 2021, 19:50:04 UTC - in response to Message 63094. I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520. My main machine is this one: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.45 GB Cache 16896 KB It seems to me it is worth running. It runs about 16ms/timestep for hadam4h_h0d4_200711_5_889_012043959_0 UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu ID: 63302 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 63303 - Posted: 9 Jan 2021, 22:19:43 UTC - in response to Message 63302. It may depend more on what other projects you're trying to run at the same time. ID: 63303 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63304 - Posted: 9 Jan 2021, 22:41:07 UTC I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520. Both are still faster than my laptop. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63304 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 63305 - Posted: 10 Jan 2021, 12:32:19 UTC - in response to Message 63303. It may depend more on what other projects you're trying to run at the same time. No doubt. My machine has Number of processors 16 (8 hyperthreaded cores) Memory 62.45 GB Cache 16896 KB so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. ID: 63305 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934	Message 63306 - Posted: 10 Jan 2021, 19:40:23 UTC so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. ID: 63306 ·