climateprediction.net (CPDN) home page
Thread 'New work Discussion'

Thread 'New work Discussion'

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 52 · 53 · 54 · 55 · 56 · 57 · 58 . . . 91 · Next

AuthorMessage
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63173 - Posted: 24 Dec 2020, 16:34:29 UTC - in response to Message 63172.  

Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard.
ID: 63173 · Report as offensive
ProfileBill F

Send message
Joined: 17 Jan 09
Posts: 124
Credit: 2,030,323
RAC: 2,771
Message 63174 - Posted: 25 Dec 2020, 1:19:53 UTC - in response to Message 63173.  

Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard.


Windows and Linux have different requirements for running CPDN and different recommendations for a "better" chance of success on WU's .... These are generalizations not subject to issues of the different Tasks being distributed at any given time.

The message boards have several treads to look at here are just a couple

Linux Libraries https://www.cpdn.org/forum_thread.php?id=7828#49056

Memory recommendations https://www.cpdn.org/forum_thread.php?id=8185#53062

BOINC Settings https://www.cpdn.org/forum_thread.php?id=7931#50571

Bill F
ID: 63174 · Report as offensive
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,991,636
RAC: 14,563
Message 63175 - Posted: 25 Dec 2020, 1:20:49 UTC - in response to Message 63167.  

I snagged 3 on their second attempt (without trying). One errored out after 6 zips but the other two are up to 16 and going strong (fingers and other digits crossed).

BTW. Happy Xmas one and all!
ID: 63175 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63176 - Posted: 25 Dec 2020, 1:48:36 UTC - in response to Message 63175.  

We haven't heard back from the project, but it's possible that this batch is running right near the edge of safe parameter space.
ID: 63176 · Report as offensive
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 63177 - Posted: 25 Dec 2020, 2:53:21 UTC - in response to Message 63176.  

Of the Windows regional models at 50 km resolution, the SAFR region batches have a relatively lower success rate than the other regions. I have no idea why, but as Les said, they may be running these experiments on that region with parameters that are closer to the edge of instability. I remember having quite a number of signal 11 failures with those earlier SAFR batches.
ID: 63177 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63178 - Posted: 25 Dec 2020, 5:35:12 UTC - in response to Message 63177.  

I remember having quite a number of signal 11 failures with those earlier SAFR batches.


Assuming it is a pushing the physics issue, it is unfortunate the error message isn't more informative.

To put some figures on it it is looking like twice as many have hard failed as succeeded so far. I haven't looked at the successes to see if any of them failed first time around with the signal 11. As the fails seem to be doing so after the 6th month, it may be too early for that anyway.

Success: 30 (1%)
Fails: 1804 (52%)
Hard Fail: 67 (2%)
Running: 3403 (97%)
Unsent: 0 (0%)
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63178 · Report as offensive
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63179 - Posted: 25 Dec 2020, 7:39:46 UTC - in response to Message 63174.  

Thank you. I thought it was the fault of both the machines. One Turbo-Boosts the clock speeds all by itself. Gaming laptop, I switched off these tendencies of it. I cannot switch off its Hyper-Threading because Acer has locked the BIOS. The other one switched off its HT, but it also started giving errors. I hope I complete one at least out of seventeen. So far, nine have gone to the graveyard.


Windows and Linux have different requirements for running CPDN and different recommendations for a "better" chance of success on WU's .... These are generalizations not subject to issues of the different Tasks being distributed at any given time.

The message boards have several treads to look at here are just a couple

Linux Libraries https://www.cpdn.org/forum_thread.php?id=7828#49056

Memory recommendations https://www.cpdn.org/forum_thread.php?id=8185#53062

BOINC Settings https://www.cpdn.org/forum_thread.php?id=7931#50571

Bill F

--------------------------------
My Linux systems are behaving. It is my Windows systems with these new WU's. I am keeping an eye on them now and I noted that one crashed at 99%. They seem to be crashing right at the end; anyway I am getting more of these WU's and all have been run at least once. They can keep coming. I am at peace.
ID: 63179 · Report as offensive
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63180 - Posted: 25 Dec 2020, 9:35:35 UTC

If I may ask, what does "Signal 11" mean?
ID: 63180 · Report as offensive
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 63181 - Posted: 25 Dec 2020, 11:30:34 UTC - in response to Message 63180.  

If I may ask, what does "Signal 11" mean?


A segmentation fault is a memory error (Wikipedia).

This might come from a computation that leads to an array index that isn't checked by the software itself before trying to access the indexed data (for understandable performance reasons). In other words, the temptation is to suppose a hardware memory error but if lots of models are failing, as for this batch, then it looks more like a programming/parameter problem.

However, there have also been error messages reported by BOINC applications in which an error number is generated by, say, FORTRAN but is then reported by the BOINC application as if the error number was from the BOINC error world (C, Linux etc.). In such a case the error number is valid but the text reported by the BOINC Manager is not. It's a long time since I systematically investigated this kind of thing - because, happily, my models almost never crash any more - but maybe some BOINC people might have a better answer or the project developers themselves.
ID: 63181 · Report as offensive
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63182 - Posted: 26 Dec 2020, 2:03:44 UTC - in response to Message 63181.  

Thank you.
ID: 63182 · Report as offensive
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 63198 - Posted: 28 Dec 2020, 15:21:08 UTC

... having said what I did about "Signal 11" I am now a bit surprised that a reissue that previously failed with a "Signal 11" on someone else's machine has now completed on my machine.

If it was a parameter error then that shouldn't happen.

[Oops - engaged brain: the other machine was AMD and mine is Intel: butterfly flaps its wings in the Amazon etc. - so the model development would have been different on the two machines even if the parameters are the same.]
ID: 63198 · Report as offensive
JTM

Send message
Joined: 12 Oct 15
Posts: 2
Credit: 7,602,290
RAC: 0
Message 63282 - Posted: 6 Jan 2021, 12:17:21 UTC - in response to Message 63161.  

Good afternoon all,

I'm new to Linux and have no 'deep' experience with BOINC - I'm just a client user, go easy on me!

I'm in the same boat as some others here, having installed BOINC on a Linux VM running on one of my windows PCs and receiving no tasks. I have checked the 'no_alt_platform' parameter is zero, installed the 32-bit libraries and set up the VM with 4 cores (out of 32) and 16GB RAM (also from 32, I can shunt some around between machines as needed once up and running).

Can I ask if there are any updates on this topic, or proposed solutions? I've checked parameters, removed and reattached the project already.

Thanks!

J.
ID: 63282 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63283 - Posted: 6 Jan 2021, 13:29:30 UTC - in response to Message 63282.  

Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced.

Alse, this may be nothing to do with your issue but do make sure that when requesting work manually via the update button you wait at least an hour after the last request otherwise a setting on the server will send a message that the last request was too recent.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63283 · Report as offensive
JTM

Send message
Joined: 12 Oct 15
Posts: 2
Credit: 7,602,290
RAC: 0
Message 63284 - Posted: 6 Jan 2021, 14:22:15 UTC - in response to Message 63283.  

Could you post the lines from the event log under the tools menu from when you request work? This will let us see if it is probably the same issue as others have experienced.

Alse, this may be nothing to do with your issue but do make sure that when requesting work manually via the update button you wait at least an hour after the last request otherwise a setting on the server will send a message that the last request was too recent.


Thanks for the response, though my VM client is now chewing on two tasks! I restarted my VM less than an hour before the successful fetch, so I don't have the error log from a failed fetch. From memory, there was nothing obvious in the log for a failure, just 'got 0 new tasks' and 'project requested delay of 3636 seconds'. The last change I made was to reload the 32-bit libraries following a post elsewhere in the forum, so perhaps this was my issue?

Seems to be solved now, thanks again.

J.
ID: 63284 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63285 - Posted: 6 Jan 2021, 15:40:41 UTC - in response to Message 63284.  

Glad you got it sorted.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63285 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63302 - Posted: 9 Jan 2021, 19:50:04 UTC - in response to Message 63094.  

I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520.


My main machine is this one:

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	62.45 GB
Cache 	16896 KB


It seems to me it is worth running. It runs about 16ms/timestep for hadam4h_h0d4_200711_5_889_012043959_0
UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu
ID: 63302 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63303 - Posted: 9 Jan 2021, 22:19:43 UTC - in response to Message 63302.  

It may depend more on what other projects you're trying to run at the same time.
ID: 63303 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63304 - Posted: 9 Jan 2021, 22:41:07 UTC

I'm trying to figure out if it is even worthwhile to keep running these on my struggling i7-920 and Xeon w3520.


Both are still faster than my laptop.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63304 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 63305 - Posted: 10 Jan 2021, 12:32:19 UTC - in response to Message 63303.  

It may depend more on what other projects you're trying to run at the same time.


No doubt. My machine has
Number of processors 16 (8 hyperthreaded cores)
Memory 62.45 GB
Cache 16896 KB
so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units.
ID: 63305 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,019,755
RAC: 20,934
Message 63306 - Posted: 10 Jan 2021, 19:40:23 UTC

so I run three UK Met Office HadAM4 at N216 resolution v8.52 i686-pc-linux-gnu tasks at a time. At the moment, the others are one rosetta@home and four WCG work units.


If the WCG tasks are Africa Rain Project ones they like the N216 tasks use a lot of cache memory. There may well be other ones that are similarly high on resource use that I don't know about.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63306 · Report as offensive
Previous · 1 . . . 52 · 53 · 54 · 55 · 56 · 57 · 58 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 cpdn.org