climateprediction.net (CPDN) home page
Thread 'Site problems'

Thread 'Site problems'

Message boards : Number crunching : Site problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64606 - Posted: 10 Oct 2021, 0:00:15 UTC - in response to Message 64604.  

As a side issue, if you’re running 17 CPDN WUs at a time, 63 WUs reserve is over a month’s worth. Any particular reason for holding that many?


My processor is set up to run at most four CPDN work units at a time. It can also run WCG, rosetta, and universe work units. My preferences are to store up to 1.5 days additional work. In practice, when a CPDN work unit gets down to about a day to go, my client gets an additional work unit. Once I saw it get two additional work units because two of those running were almost complete. It normally takes me about eight days to complete an N216 work unit. My processor has 16 cores: 8 real and 8 hyperthreaded. I allow the client to use up to 8 cores for work-units.

Right now, I have two N216 work units running. I have none in reserve. (This is just an observation, not a complaint.)
ID: 64606 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64608 - Posted: 10 Oct 2021, 12:07:55 UTC - in response to Message 64604.  
Last modified: 10 Oct 2021, 12:11:26 UTC

As a side issue, if you’re running 17 CPDN WUs at a time, 63 WUs reserve is over a month’s worth. Any particular reason for holding that many?
I sit on none that are Ready to Start. I get the best results running one or two per computer. The 63 were all Waiting to Run and several are now running. I doubt it'll take a month to finish.
It's a mystery to me how they ever finish a project. It seems like they'd take a couple of years at best with many holes in it. Maybe there's method to their madness but I suspect it's just tea and crumpets.
ID: 64608 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64609 - Posted: 10 Oct 2021, 20:16:38 UTC

It's a mystery to me how they ever finish a project.
They know how many of a batch will come back in a reasonable time and send out a number of work units that will bring back that many results. Sometimes if a batch is pushing the physics to the limits and consequently gets a higher failure rate than allowed for they will send out some extras.
ID: 64609 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64610 - Posted: 11 Oct 2021, 12:05:55 UTC - in response to Message 64609.  

It's a mystery to me how they ever finish a project.
They know how many of a batch will come back in a reasonable time and send out a number of work units that will bring back that many results. Sometimes if a batch is pushing the physics to the limits and consequently gets a higher failure rate than allowed for they will send out some extras.
Extremely inefficient and wasteful. They really should learn how to use BOINC. They could greatly increase their throughput with a modest effort.
ID: 64610 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64611 - Posted: 11 Oct 2021, 12:50:25 UTC - in response to Message 64610.  

They DO know how to use BOINC. But it's just a sideline to their uni work.
And now that Oxford is back in term, the researchers who have tasks running will look at the results that they've gotten back.

Everything is fine.
Do not panic.
Do not adjust your minds.
ID: 64611 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64612 - Posted: 11 Oct 2021, 13:12:19 UTC

No clearly they don't. If they know what they're doing then why do we have to downgrade 3 libraries in order to run CP WUs???
They should recompile their code to include current libraries that are maintained in the Linux reposititories. They should also fix the numerous segmentation violations.
But since they don't even care enough about this project to even read these forums I doubt anything will ever improve.
ID: 64612 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64613 - Posted: 11 Oct 2021, 13:41:48 UTC - in response to Message 64612.  

No clearly they don't. If they know what they're doing then why do we have to downgrade 3 libraries in order to run CP WUs???


I have never had to downgrade any libraries to run CPDN work. I have installed additional 32 bit libraries but that is down tot he code belonging to the met office and the licence for that code not allowing Oxford to modify it. There are other projects that require the 32 bit libraries as well though I forget which at the moment.
ID: 64613 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64614 - Posted: 11 Oct 2021, 13:52:38 UTC

but that is down tot he code belonging to the met office and the licence for that code not allowing Oxford to modify it.


I don't see many people in the queue to rewrite around a million lines of Fortran code either.
ID: 64614 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 64615 - Posted: 12 Oct 2021, 16:27:17 UTC

Finally the rubbish error messages at the top of each forum page have gone :-)

Hopefully this means that the server certificates are now correct. Thank you to whoever has corrected it.
ID: 64615 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64616 - Posted: 12 Oct 2021, 18:40:18 UTC - in response to Message 64615.  

Finally the rubbish error messages at the top of each forum page have gone :-)

Hopefully this means that the server certificates are now correct. Thank you to whoever has corrected it.


My system updated Firefox yesterday, and some system stuff today. I am not sure when this change took place:

-r--r--r--. 1 root root 243169 Sep 30 11:13 /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt[

But one of those changed things for the better.
ID: 64616 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64617 - Posted: 12 Oct 2021, 21:30:30 UTC - in response to Message 64612.  

They should also fix the numerous segmentation violations.

I get none of them. Error tasks from 31 Dec 2020 to 7 Sep 2021. This machine started running Boinc 19 Nov 2020.
Bad Buffin           2
Negative Theta       4
WU Download Err      3
Error Code 25        3
Replenca             1
Negative Pressure    1
INITTIME             2
Setpos seek fail     1
Invalid Theta        1
Segv                 3

errno 25 in Linux is

#define ENOTTY 25 /* Not a typewriter */
ID: 64617 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64618 - Posted: 12 Oct 2021, 21:33:08 UTC - in response to Message 64617.  

P.S. those Segv errors were when the system was trying to do a stack trace after the process had already failed.
ID: 64618 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 64619 - Posted: 13 Oct 2021, 7:00:29 UTC

I get none of them. Error tasks from 31 Dec 2020 to 7 Sep 2021. This machine started running Boinc 19 Nov 2020.


Interestingly, I got one two days ago, and the task failed either at exactly the same point or certainly close to it on its previous attempt.
ID: 64619 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,074
RAC: 14,759
Message 64624 - Posted: 14 Oct 2021, 22:34:07 UTC - in response to Message 64587.  

Looks like the certificate problem on Linux has been sorted for the time being.

Tue 12 Oct 2021 05:26:23 BST | | [http] [ID#0] Info: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
Tue 12 Oct 2021 05:26:23 BST | | [http] [ID#0] Info: ALPN, server accepted to use h2
TThu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: ALPN, server did not agree to a protocol
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: Server certificate:
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: subject: CN=www.cpdn.org
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: start date: Aug 15 23:07:04 2021 GMT
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: expire date: Nov 13 23:07:02 2021 GMT
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: subjectAltName: host "www.cpdn.org" matched cert's "www.cpdn.org"
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: issuer: C=US; O=Let's Encrypt; CN=R3
Thu 14 Oct 2021 23:07:34 BST | climateprediction.net | [http] [ID#1] Info: SSL certificate verify ok.
ID: 64624 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64625 - Posted: 14 Oct 2021, 23:57:18 UTC - in response to Message 64624.  

Posted: 12 Oct 2021, 18:40:18 UTC


I think the problem of all those complaints on the top of the web site pages has been fixed on my
Red Hat Enterprise Linux release 8.4 (Ootpa)
system, starting no later than the date and time above.

I am not getting any work units, but that is just a fact, not a complaint or problem.
ID: 64625 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64653 - Posted: 19 Oct 2021, 20:10:43 UTC

Uploads failing.
ID: 64653 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64654 - Posted: 19 Oct 2021, 20:16:45 UTC - in response to Message 64653.  

Check the batch number. If it's closed, then that's the reason.
ID: 64654 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,033,074
RAC: 14,759
Message 64656 - Posted: 20 Oct 2021, 11:49:19 UTC

Looks like we have a problem with uploads:

Wed 20 Oct 2021 12:33:13 BST | climateprediction.net | Temporarily failed upload of hadam4h_h0ye_201505_5_901_012076497_3_r1813679968_out.zip: transient HTTP error
ID: 64656 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,717,389
RAC: 8,111
Message 64657 - Posted: 20 Oct 2021, 12:20:31 UTC - in response to Message 64656.  

That looks like task 22141323, running under Linux - so it shouldn't be the certificate expiry problem (that normally affects Windows only).

You could try enabling http_debug logging temporarily, to see exactly what the nature of that 'transient HTTP error' is.
ID: 64657 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64658 - Posted: 20 Oct 2021, 12:43:37 UTC - in response to Message 64654.  
Last modified: 20 Oct 2021, 12:46:57 UTC

Check the batch number. If it's closed, then that's the reason.
Batches that will not upload: 852, 883, 886, and 895.
16533 10/20/2021 5:43:57 AM Project communication failed: attempting access to reference site
16534 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip: transient HTTP error
16535 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:20:06 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip
16536 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip: transient HTTP error
16537 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:14:14 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip
16538 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip: transient HTTP error
16539 climateprediction.net 10/20/2021 5:43:57 AM Backing off 03:02:20 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip
16540 10/20/2021 5:43:58 AM Internet access OK - project servers may be temporarily down.
ID: 64658 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Site problems

©2024 cpdn.org