|
Info | Message |
---|---|
1) Message boards : Number crunching : Stats export
Message 71966 Posted 10 days ago by Richard Haselgrove |
https://main.cpdn.org/stats/ is viewable and contains files, but they're all dated 9 February - matching the BOINSstats report: so the the problem is internal to CPDN. The CPDN internal server status page (https://main.cpdn.org/server_status.php) shows that all services except the Scheduler are "Not Running", so we'll have to wait for those to be restarted before there's anything to export. |
2) Message boards : Number crunching : Stats export
Message 71948 Posted 31 Jan 2025 by Richard Haselgrove |
The message got through, and has been acted upon. The first updated file has been collected by BOINCstats, and all should be processed over the course of their usual 24-hour cycle. Now to check the other two public aggregation sites. Edit - they're OK. |
3) Message boards : Number crunching : Stats export
Message 71946 Posted 29 Jan 2025 by Richard Haselgrove |
I had observed the same thing myself while processing my final resend - it appeared to start from 09 Jan 2025 in my case. I think I've seen other anomalies as well. I'll try and have a deeper dive over the next few days, if I don't get too many distractions. |
4) Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message 71938 Posted 25 Jan 2025 by Richard Haselgrove |
Yes - I was posting from a Windows machine, and checked it from Linux later. I sometimes do that when doing housekeeping - work on the Linux machine, while referring to the instructions on a Windows machine and separate screen beside it. And I haven't found a way for making that work in the current state of the BOINC documentation. Linux needs to be listed on the 'download all' page, which is otherwise cross-platform. |
5) Message boards : Number crunching : Feedback on running OpenIFS large memory (16-25 Gb+) configurations requested
Message 71936 Posted 25 Jan 2025 by Richard Haselgrove |
Presumably it is fixed in 8.1.0 which can be installed following the instruction on the BOINC download page.v8.1.0 (odd version number) is very much for 'work in progress', and is constantly changing. It can't be downloaded in a 'ready to run' form: it has to be compiled by the user from source code. Some users may be equipped to handle that process, but I don't think it can be recommended for the vast majority of our users. Instead, there's a version 8.0.4 available on the 'all versions' download page (https://boinc.berkeley.edu/download_all.php), though unfortunately not for Linux: and the instructions for building your own copy have gone AWOL from https://boinc.berkeley.edu/wiki/BuildSystem I think we probably need to engage with BOINC about getting a usable version of BOINC, and the related documentation, available for the general Linux user. But just at the moment, the key people seem to be tying themselves in knots over an incompatibility between BOINC and VirtualBox on Apple machines. |
6) Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC
Message 71867 Posted 17 Nov 2024 by Richard Haselgrove |
And I've filleted out the CPDN messages from the 'old' log file, and there's nothing unexpected. My models run 24/7 without suspension or swap-out (except when the unexpected happens ...), so it's easy to check that the 'trickle' pattern is uniform with a flicker test. But I do tend to run the machines full-bore when work is available. There was a second CPDN task running when this one started, which may account for some extra overload: and the other cores were working on simpler, integer-only, tasks from another project throughout. My suspicion is that all the extra bells and whistles added to Windows 11 and constantly checking for outside news to tell me about, are the main cause of the lost time. |
7) Questions and Answers : Windows : "Calculation failure" after whenever i reboot the PC
Message 71863 Posted 16 Nov 2024 by Richard Haselgrove |
I got caught out by failing to follow my own advice! My Windows 11 laptop was running a batch 1024 resend with wah2 (region independent) v8.32 when Microsoft Update Tuesday struck: it was actually applied to this machine on 14 November (Thursday). The task (22528569) failed, so I looked into the log files. The machine rotated the stdoutdae files on this reboot, so I found: In the 'old' file, the last few lines are: 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Deferring communication for 00:01:49 14-Nov-2024 23:32:05 [climateprediction.net] [sched_op] Reason: Unrecoverable error for task wah2_eas25_a17m_201212_24_1024_012323825_1 14-Nov-2024 23:32:10 [climateprediction.net] Computation for task wah2_eas25_a17m_201212_24_1024_012323825_1 finished {list of absent upload files} 14-Nov-2024 23:32:10 [---] Exiting StartServiceCtrlDispatcher being called. This may take several seconds. Please wait.Then, the new file starts with: 14-Nov-2024 23:32:54 [---] Starting BOINC client version 8.0.3 for windows_x86_64 {normal startup messages} 14-Nov-2024 23:32:58 [climateprediction.net] [sched_op] Starting scheduler request 14-Nov-2024 23:32:58 [climateprediction.net] Sending scheduler request: To report completed tasks.and so on. So it appears that the damage is actually done during closedown - not, as I had assumed, at restart. Windows applies the updates and triggers the restart "outside working hours", so it will be assuming that the user isn't around to respond to the messages that are sometimes generated during a 'soft' close ('save edited file?', and so on). So, is this version of the app responding badly to 'hard' shutdowns? |
8) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024
Message 71861 Posted 12 Nov 2024 by Richard Haselgrove |
What was odd was I was still able to manually kick the remaining transfers to complete them all. The client didn't abort remaining task transfers after marking the task as 'download failed'?It's probably down to a twenty year old unspoken assumption, from the early design days of BOINC. I would guess that, with slower download speeds available and smaller local disks, they didn't think ahead to projects using multiple, multi-megabyte, downloads per task. And if they didn't think about the problem, they wouldn't have bothered to code the solution. We've had a few of these legacy oversights retro-fitted over the years: it's worth raising an issue on GitHub. |
9) Message boards : Number crunching : Connection and Download issues Oct24
Message 71821 Posted 4 Nov 2024 by Richard Haselgrove |
On the 'credit' side of the situation: remember that CPDN awards credit incrementally, as each trickle is reported. Those interim credits would NOT be taken away if either you or the project aborted the task early - the only 'wasted' computation time would be that from the time of the final trickle to the time of the abort. You could wait until a tickle was reported, and abort the task immediately afterwards - wasting practically nothing. |
10) Message boards : Number crunching : Connection and Download issues Oct24
Message 71808 Posted 3 Nov 2024 by Richard Haselgrove |
Uploads go direct to the climate researchers who commissioned the batch - in this case, in New Zealand. They don't follow the administrative route to Oxford. |
11) Message boards : Number crunching : Connection and Download issues Oct24
Message 71806 Posted 3 Nov 2024 by Richard Haselgrove |
Well, that was a nice quiet weekend. From what I can see, once the logjam was released late on Friday afternoon, everything has been running as it should. Uploads and trickle reports have be sent to their respective destinations, task pages show that credit awards have been made in real time, and the external aggregation sites have been able to collect their data packages as normal. Of course, the relatively few remaining tasks in this batch were scooped up very quickly, so we can't confirm just yet that every host that requests work can be serviced. But it's looking good. The Friday restart was the completion of the recovery process, with DNS and SSL returned to their status quo ante. But that leaves some space to consider the initial cause of the problems - the one which made it impossible to download fresh copies of the application files where needed. After looking through the logs, that seemed to me to be an attempt to deploy 'cloudflare' - a transparent caching program. This would actually be very useful to the project - it can save a huge amount of (paid-for) bandwidth when new applications are to be deployed. According to Glenn, "The next project to go out will be using the HadAM4 N216 application, linux only." - once final development tweaks to the application have been added and tested. So that's exactly the situation where cloudflare would be helpful. I would hope that the team will use this quiet break between batches to double-check the cloudflare manual and try again (and if they weren't planning to, I would suggest it!). But this time, please test it while things are resting, not in the heat of a batch release! |
12) Message boards : Number crunching : Connection and Download issues Oct24
Message 71789 Posted 1 Nov 2024 by Richard Haselgrove |
That must be a quirk of Firefox. Chrome gives me exactly the url I asked for. |
13) Message boards : Number crunching : Connection and Download issues Oct24
Message 71787 Posted 1 Nov 2024 by Richard Haselgrove |
If you use "cpdn.org" without www or main it will default to using www ?cpdn.org has its own DNS entry - that points to 162.159.140.127, which holds a copy of the main publicity page of climateprediction.net Edit - climateprediction.net also points to 162.159.140.127, so it's an alias rather than a copy. |
14) Message boards : Number crunching : Connection and Download issues Oct24
Message 71778 Posted 1 Nov 2024 by Richard Haselgrove |
Yes, I noticed. The problem is with the SSL certificate for www.cpdn.org which boinc needs, cpdn.org works ok for the website.To be exact, we have two separate urls for the same server/filing system - the whole operational shooting match. They are: www.cpdn.org main.cpdn.org As I type, both urls point to the same ip address (129.67.193.106) - that's DNS doing its thing. But main.cpdn.org is protected by an SSL certificate, www.cpdn.org isn't. If you use a browser to bypass the certificate check, you can see exactly the same data - but BOINC doesn't bypass the check. Provided the 'one server' identity was intentional (we haven't had that confirmed by Oxford), then an alt name on the certificate should resolve all the problems (including data collection by the external stats sites - the data dumps are up-to-date as of 17:50 on 31 October). |
15) Message boards : Number crunching : Connection and Download issues Oct24
Message 71775 Posted 1 Nov 2024 by Richard Haselgrove |
I notice that the number of tasks ready to send has been stuck for at least 2 refreshes of the page by the time of last update and the dropping of no of users in last 24 hours suggests no reporting of tasks is happening.Everything to do with tasks has to go through the scheduler - fetching new work, reporting trickles, reporting completed tasks. So when the scheduler is inaccessible, everything stops. |
16) Message boards : Number crunching : Connection and Download issues Oct24
Message 71764 Posted 31 Oct 2024 by Richard Haselgrove |
I've asked Andy about this and he says this will settle down once the DNS changes made today to fix the problem take effect and trickle down to hosts.I'm not sure I'm convinced that a DNS re-configuration would resolve an SSL certificate domain-name mismatch, but we can only wait and see. |
17) Message boards : Number crunching : Connection and Download issues Oct24
Message 71763 Posted 31 Oct 2024 by Richard Haselgrove |
It would appear that there are two problems:You're right, but you're looking at the wrong part of the log. The trickle error is: 31/10/2024 18:31:53 | climateprediction.net | [sched_op] Starting scheduler requestThe uploads are going through OK: 31/10/2024 17:07:01 | climateprediction.net | Started upload of wah2_nz25_11y8_209705_25_1028_012346455_0_r970510034_3.zipThe timings aren't comparable, because the trickles are given the scheduler backoff of around an hour, and then retried. Those were both from task 22523488: one of the new batch, picked up just before midnight last night. It's showing a reported trickle at 10:52 UTC today - there should have been another one around tea-time, but it got stuck. |
18) Message boards : Number crunching : Connection and Download issues Oct24
Message 71753 Posted 31 Oct 2024 by Richard Haselgrove |
The same client iteration in Win10 in a VM still says, "Internet servers may be temporarily down."That sounds more like the message you get from the 'reference site' (google.com) when the BOINC client wants to check if a problem is project-specific or global. That can be kicked out of the way with <dont_contact_ref_site>1</dont_contact_ref_site>in cc_config.xml |
19) Message boards : Number crunching : Connection and Download issues Oct24
Message 71748 Posted 31 Oct 2024 by Richard Haselgrove |
Still getting31/10/2024 14:16:52 | climateprediction.net | [http] [ID#7] Info: schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - The target principal name is incorrect.on those non-permanent app downloads. I'll let it keep trying, and then try a fetch once I have the apps. |
20) Message boards : Number crunching : Connection and Download issues Oct24
Message 71746 Posted 31 Oct 2024 by Richard Haselgrove |
Now it's changed to:31/10/2024 12:16:46 | climateprediction.net | [http] [ID#5] Info: Connected to www.cpdn.org (129.67.193.106) port 443 31/10/2024 12:16:46 | climateprediction.net | [http] [ID#5] Info: schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - The target principal name is incorrect. 31/10/2024 12:16:46 | climateprediction.net | [http] [ID#5] Info: schannel: shutting down SSL/TLS connection with www.cpdn.org port 443 31/10/2024 12:16:46 | climateprediction.net | [http] HTTP error: SSL peer certificate or SSH remote key was not OK 31/10/2024 12:16:46 | climateprediction.net | Temporarily failed download of wah2_8.24_windows_intelx86.exe: transient HTTP error |
©2025 cpdn.org