Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 52 · 53 · 54 · 55 · 56 · 57 · 58 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. |
Send message Joined: 17 Jan 09 Posts: 124 Credit: 2,030,323 RAC: 2,771 |
Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. Windows and Linux have different requirements for running CPDN and different recommendations for a "better" chance of success on WU's .... These are generalizations not subject to issues of the different Tasks being distributed at any given time. The message boards have several treads to look at here are just a couple Linux Libraries https://www.cpdn.org/forum_thread.php?id=7828#49056 Memory recommendations https://www.cpdn.org/forum_thread.php?id=8185#53062 BOINC Settings https://www.cpdn.org/forum_thread.php?id=7931#50571 Bill F |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,992,465 RAC: 14,585 |
I snagged 3 on their second attempt (without trying). One errored out after 6 zips but the other two are up to 16 and going strong (fingers and other digits crossed). BTW. Happy Xmas one and all! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
We haven't heard back from the project, but it's possible that this batch is running right near the edge of safe parameter space. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Of the Windows regional models at 50 km resolution, the SAFR region batches have a relatively lower success rate than the other regions. I have no idea why, but as Les said, they may be running these experiments on that region with parameters that are closer to the edge of instability. I remember having quite a number of signal 11 failures with those earlier SAFR batches. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
I remember having quite a number of signal 11 failures with those earlier SAFR batches. Assuming it is a pushing the physics issue, it is unfortunate the error message isn't more informative. To put some figures on it it is looking like twice as many have hard failed as succeeded so far. I haven't looked at the successes to see if any of them failed first time around with the signal 11. As the fails seem to be doing so after the 6th month, it may be too early for that anyway. Success: 30 (1%) Fails: 1804 (52%) Hard Fail: 67 (2%) Running: 3403 (97%) Unsent: 0 (0%) Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard. -------------------------------- My Linux systems are behaving. It is my Windows systems with these new WU's. I am keeping an eye on them now and I noted that one crashed at 99%. They seem to be crashing right at the end; anyway I am getting more of these WU's and all have been run at least once. They can keep coming. I am at peace. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
If I may ask, what does "Signal 11" mean? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
If I may ask, what does "Signal 11" mean? A segmentation fault is a memory error (Wikipedia). This might come from a computation that leads to an array index that isn't checked by the software itself before trying to access the indexed data (for understandable performance reasons). In other words, the temptation is to suppose a hardware memory error but if lots of models are failing, as for this batch, then it looks more like a programming/parameter problem. However, there have also been error messages reported by BOINC applications in which an error number is generated by, say, FORTRAN but is then reported by the BOINC application as if the error number was from the BOINC error world (C, Linux etc.). In such a case the error number is valid but the text reported by the BOINC Manager is not. It's a long time since I systematically investigated this kind of thing - because, happily, my models almost never crash any more - but maybe some BOINC people might have a better answer or the project developers themselves. |
Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0 |
Thank you. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
... having said what I did about "Signal 11" I am now a bit surprised that a reissue that previously failed with a "Signal 11" on someone else's machine has now completed on my machine. If it was a parameter error then that shouldn't happen. [Oops - engaged brain: the other machine was AMD and mine is Intel: butterfly flaps its wings in the Amazon etc. - so the model development would have been different on the two machines even if the parameters are the same.] |
Send message Joined: 12 Oct 15 Posts: 2 Credit: 7,602,290 RAC: 0 |
Good afternoon all, I'm new to Linux and have no 'deep' experience with BOINC - I'm just a client user, go easy on me! I'm in the same boat as some others here, having installed BOINC on a Linux VM running on one of my windows PCs and receiving no tasks. I have checked the 'no_alt_platform' parameter is zero, installed the 32-bit libraries and set up the VM with 4 cores (out of 32) and 16GB RAM (also from 32, I can shunt some around between machines as needed once up and running). Can I ask if there are any updates on this topic, or proposed solutions? I've checked parameters, removed and reattached the project already. Thanks! J. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced. Alse, this may be nothing to do with your issue but do make sure that when requesting work manually via the update button you wait at least an hour after the last request otherwise a setting on the server will send a message that the last request was too recent. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 12 Oct 15 Posts: 2 Credit: 7,602,290 RAC: 0 |
Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced. Thanks for the response, though my VM client is now chewing on two tasks! I restarted my VM less than an hour before the successful fetch, so I don't have the error log from a failed fetch. From memory, there was nothing obvious in the log for a failure, just 'got 0 new tasks' and 'project requested delay of 3636 seconds'. The last change I made was to reload the 32-bit libraries following a post elsewhere in the forum, so perhaps this was my issue? Seems to be solved now, thanks again. J. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
Glad you got it sorted. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520. My main machine is this one: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.45 GB Cache 16896 KB It seems to me it is worth running. It runs about 16ms/timestep for hadam4h_h0d4_200711_5_889_012043959_0 UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It may depend more on what other projects you're trying to run at the same time. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520. Both are still faster than my laptop. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It may depend more on what other projects you're trying to run at the same time. No doubt. My machine has Number of processors 16 (8 hyperthreaded cores) Memory 62.45 GB Cache 16896 KB so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,019,755 RAC: 20,934 |
so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units. If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about. Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer. |
©2024 cpdn.org