Message boards : Number crunching : Site problems
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
As a side issue, if you’re running 17 CPDN WUs at a time, 63 WUs reserve is over a month’s worth. Any particular reason for holding that many? My processor is set up to run at most four CPDN work units at a time. It can also run WCG, rosetta, and universe work units. My preferences are to store up to 1.5 days additional work. In practice, when a CPDN work unit gets down to about a day to go, my client gets an additional work unit. Once I saw it get two additional work units because two of those running were almost complete. It normally takes me about eight days to complete an N216 work unit. My processor has 16 cores: 8 real and 8 hyperthreaded. I allow the client to use up to 8 cores for work-units. Right now, I have two N216 work units running. I have none in reserve. (This is just an observation, not a complaint.) |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
As a side issue, if you’re running 17 CPDN WUs at a time, 63 WUs reserve is over a month’s worth. Any particular reason for holding that many?I sit on none that are Ready to Start. I get the best results running one or two per computer. The 63 were all Waiting to Run and several are now running. I doubt it'll take a month to finish. It's a mystery to me how they ever finish a project. It seems like they'd take a couple of years at best with many holes in it. Maybe there's method to their madness but I suspect it's just tea and crumpets. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
It's a mystery to me how they ever finish a project.They know how many of a batch will come back in a reasonable time and send out a number of work units that will bring back that many results. Sometimes if a batch is pushing the physics to the limits and consequently gets a higher failure rate than allowed for they will send out some extras. |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
Extremely inefficient and wasteful. They really should learn how to use BOINC. They could greatly increase their throughput with a modest effort.It's a mystery to me how they ever finish a project.They know how many of a batch will come back in a reasonable time and send out a number of work units that will bring back that many results. Sometimes if a batch is pushing the physics to the limits and consequently gets a higher failure rate than allowed for they will send out some extras. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
They DO know how to use BOINC. But it's just a sideline to their uni work. And now that Oxford is back in term, the researchers who have tasks running will look at the results that they've gotten back. Everything is fine. Do not panic. Do not adjust your minds. |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
No clearly they don't. If they know what they're doing then why do we have to downgrade 3 libraries in order to run CP WUs??? They should recompile their code to include current libraries that are maintained in the Linux reposititories. They should also fix the numerous segmentation violations. But since they don't even care enough about this project to even read these forums I doubt anything will ever improve. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
No clearly they don't. If they know what they're doing then why do we have to downgrade 3 libraries in order to run CP WUs??? I have never had to downgrade any libraries to run CPDN work. I have installed additional 32 bit libraries but that is down tot he code belonging to the met office and the licence for that code not allowing Oxford to modify it. There are other projects that require the 32 bit libraries as well though I forget which at the moment. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
but that is down tot he code belonging to the met office and the licence for that code not allowing Oxford to modify it. I don't see many people in the queue to rewrite around a million lines of Fortran code either. |
Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228 |
Finally the rubbish error messages at the top of each forum page have gone :-) Hopefully this means that the server certificates are now correct. Thank you to whoever has corrected it. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Finally the rubbish error messages at the top of each forum page have gone :-) My system updated Firefox yesterday, and some system stuff today. I am not sure when this change took place: -r--r--r--. 1 root root 243169 Sep 30 11:13 /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt[ But one of those changed things for the better. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
They should also fix the numerous segmentation violations. I get none of them. Error tasks from 31 Dec 2020 to 7 Sep 2021. This machine started running Boinc 19 Nov 2020. Bad Buffin 2 Negative Theta 4 WU Download Err 3 Error Code 25 3 Replenca 1 Negative Pressure 1 INITTIME 2 Setpos seek fail 1 Invalid Theta 1 Segv 3 errno 25 in Linux is #define ENOTTY 25 /* Not a typewriter */ |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
P.S. those Segv errors were when the system was trying to do a stack trace after the process had already failed. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
I get none of them. Error tasks from 31 Dec 2020 to 7 Sep 2021. This machine started running Boinc 19 Nov 2020. Interestingly, I got one two days ago, and the task failed either at exactly the same point or certainly close to it on its previous attempt. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
Looks like the certificate problem on Linux has been sorted for the time being. Tue 12 Oct 2021 05:26:23 BST | | [http] [ID#0] Info: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 Tue 12 Oct 2021 05:26:23 BST | | [http] [ID#0] Info: ALPN, server accepted to use h2 TThu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: ALPN, server did not agree to a protocol Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: Server certificate: Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: subject: CN=www.cpdn.org Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: start date: Aug 15 23:07:04 2021 GMT Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: expire date: Nov 13 23:07:02 2021 GMT Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: subjectAltName: host "www.cpdn.org" matched cert's "www.cpdn.org" Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: issuer: C=US; O=Let's Encrypt; CN=R3 Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: SSL certificate verify ok. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Posted: 12 Oct 2021, 18:40:18 UTC I think the problem of all those complaints on the top of the web site pages has been fixed on my Red Hat Enterprise Linux release 8.4 (Ootpa) system, starting no later than the date and time above. I am not getting any work units, but that is just a fact, not a complaint or problem. |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
Uploads failing. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Check the batch number. If it's closed, then that's the reason. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,975,898 RAC: 14,500 |
Looks like we have a problem with uploads: Wed 20 Oct 2021 12:33:13 BST | climateprediction.net | Temporarily failed upload of hadam4h_h0ye_201505_5_901_012076497_3_r1813679968_out.zip: transient HTTP error |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
That looks like task 22141323, running under Linux - so it shouldn't be the certificate expiry problem (that normally affects Windows only). You could try enabling http_debug logging temporarily, to see exactly what the nature of that 'transient HTTP error' is. |
Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318 |
Check the batch number. If it's closed, then that's the reason.Batches that will not upload: 852, 883, 886, and 895. 16533 10/20/2021 5:43:57 AM Project communication failed: attempting access to reference site 16534 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip: transient HTTP error 16535 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:20:06 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip 16536 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip: transient HTTP error 16537 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:14:14 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip 16538 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip: transient HTTP error 16539 climateprediction.net 10/20/2021 5:43:57 AM Backing off 03:02:20 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip 16540 10/20/2021 5:43:58 AM Internet access OK - project servers may be temporarily down. |
©2024 cpdn.org