Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
... I was sooo happy the CPDN had an abundance of jobs and joined the party - only to then find out I can't get rid of my results. Me 2. Locally here, even funnier, because of local weather severe coldsnap just when my fastest hottest CPU's ran out of work. Had to burn a lot of methane to keep my house warm. Murphy's law. And at the winter holidays, when tech support for low-budget volunteer projects infrastructure is minimal or so overpriced. It's funny, ironic, anti-serendipitous, and another example of the famous Murphy Law. keep on crunching, people. Patience pays. Thanks to all. E |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I had turned off new tasks a few days ago, and they all ran out around lunch time (my local time) today. I waited a bit, and resumed crunching by allowing new tasks. I got three and they started running. But all the accumulated uploads failed to upload (no surprise), so I am now adding to the list. CPDN is now using about 38 GBytes of disk. Luckily I have about 380 GBytes of disk space still available for Boinc. But even that will run out sometime. We were told not to expect the problem to be fixed Monday (Boxing Day in England), but maybe after 9AM on Tuesday,. It is now after 9AM on Wednesday and till no uploads. Has anyone a clue what the problem is and when they expect it to be fixed? Sigh! 8-( Wed 28 Dec 2022 01:49:48 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip Wed 28 Dec 2022 01:49:50 PM EST | | Internet access OK - project servers may be temporarily down. Wed 28 Dec 2022 01:50:15 PM EST | | Project communication failed: attempting access to reference site Wed 28 Dec 2022 01:50:15 PM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0099_2014050100_123_983_12199743_0_r1848880848_2.zip: transient HTTP error Wed 28 Dec 2022 01:50:15 PM EST | climateprediction.net | Backing off 00:02:31 on upload of oifs_43r3_ps_0099_2014050100_123_983_12199743_0_r1848880848_2.zip Wed 28 Dec 2022 01:50:17 PM EST | | Internet access OK - project servers may be temporarily down. Wed 28 Dec 2022 01:51:49 PM EST | | Project communication failed: attempting access to reference site Wed 28 Dec 2022 01:51:49 PM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip: transient HTTP error Wed 28 Dec 2022 01:51:49 PM EST | climateprediction.net | Backing off 00:03:36 on upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip Wed 28 Dec 2022 01:51:50 PM EST | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422 |
Copied from "uploads stuck" From Andy Hi Dave, Thanks. I have looked at this. This machine keeps losing it's SSH port and HTTP port. I reset it and it keeps losing it again. I am going to have a look at this again tomorrow further. Best wishes, Andy and Update to this: I have made a request to the JASMIN cloud service where this machine resides to look into this. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
Any updates with regards to stuck uploads I get will go here as opposed to the thread entitled such as it has gone on a tangent and I can't be bothered with moving all the offending posts. Beginning to look like JASMIN support are either not working 24/7 or have not been able to work out what the issue is. Edit: Support will be provided during normal working hours, defined as between 0900 and 1700 on Monday to Thursday and between 0900 and 1630 on Friday, excluding Public Holidays and STFC Privilege Days. Note that times are given in UK time.So if not sorted by 16:00 UCT tomorrow we will almost certainly have to wait till at least Tuesday iwth Monday being a public holiday in UK. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
So if not sorted by 16:00 UCT tomorrow we will almost certainly have to wait till at least Tuesday iwth Monday being a public holiday in UK. From Andy,
|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Beginning to look like JASMIN support are either not working 24/7 or have not been able to work out what the issue is. Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support. Now maybe CPDN is not an important client of theirs, dealing with critical information (banking, law enforcement, medical facilities, and G.O.K what else). Can they afford to have no technical support for over a week? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support.Wasn't that when we were still running small test batches? Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I might be wrong but I think the test batches go direct to CPDN and not via JASMIN.Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support.Wasn't that when we were still running small test batches? Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service? The support CPDN get from JASMIN will depend on their service contract. But it's laughable that JASMIN pressured CPDN to get off the older unmanaged cloud server because of support issues (which delayed the release of these batches), to then get stuffed by lack of support when the server goes down on the new cloud. Still, a backup (or two) upload server might have helped. I'm not familiar with the boinc side but it looks like there is only one in place. The next CPDN technical meeting will be interesting. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
I might be wrong but I think the test batches go direct to CPDN and not via JASMIN.That idea came about because we tried tracert to the upload server's IP address when this error first came about, and the last routing hop that responded had a .ja.net suffix. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service? I do not know about anyone else, but my one Linux machine has about 3200 CPDN .zip files to upload. That is about 28 tasks of output. It has been up for a little over two weeks. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,807,823 RAC: 19,824 |
I might be wrong but I think the test batches go direct to CPDN and not via JASMIN. ... It sounds like Richard maybe was referring to the test runs on the main site, not the dev. site. Richard, I do think that those questions are very valid, whether enough attention to detail was paid. I definitely hope that the current issue is not due to the new server not being up to par to handle the amount of uploads (which I assume will only increase with higher resolution models). |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Richard, I do think that those questions are very valid, whether enough attention to detail was paid. I definitely hope that the current issue is not due to the new server not being up to par to handle the amount of uploads (which I assume will only increase with higher resolution models).You have to remember there is only 1 full-time paid IT person at CPDN, Andy, who does a great job but is usually juggling 10 things at once. Andy is actually very good at detail (far better than my hacking about...), but there was alot of time pressure because of contract commitments from the Perturbed Surface project. There's a bit of a back story. JASMIN (the cloud provider) wanted CPDN to move to their newer managed server before the end of the year, which because of other commitments, was not done until the test batches were complete. Which made it rather a rush. I suspect it's a software issue that maybe got missed, but IMHO also crap timing and rather poor from JASMIN they can't provide any support between Christmas and New Year. The new server (when it works) has a much improved capacity, that's not the issue. Knowing the Prof. in charge of CPDN I suspect strong words will be sent to JASMIN.... I'm more worried the results won't be available in time as the PS contract finishes end of Feb. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I'm more worried the results won't be available in time as the PS contract finishes end of Feb. The upload time for existing tasks shouldn't be a server network capacity issue - 22TB on a 1Gbit upload is only about 2 days. I expect it will take far longer for a lot of users to upload their caches, though... Is it be possible to get exemptions to the "tasks per day" ramp? It seems like it takes a long time to get a new machine "up to capacity," even if it's returning valid results. Some of my compute boxes weren't able to get to full capacity for a while, though... at this point, some of them are idle on lack of upload slots or something (max uploads in progress). These tasks look like they'd be a good fit for preemptible compute instances on GCE or some other cloud platform if one wanted to throw a bunch of cores at them, but that's wasted if the machines can't stay busy. |
Send message Joined: 23 Nov 19 Posts: 4 Credit: 6,597,088 RAC: 79,816 |
Is there any update on this topic? I still have about 120Gig results clogging my drives and keep getting transient HTTP errors. Would love to help crunch, but if this is doing nothing but blocking space on my disks, what's the point? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100 |
Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,476,194 RAC: 1,633 |
It might be a good idea to prolong the deadlines on the server-side - - - - - - - - - - Greetings, Jens |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then.I am hoping all will be resolved by the end of play today or at least a significant dent made in the number of tasks needing to be uploaded. Over 300 went through in the short time the gate was open yesterday, I haven't checked exactly how many. |
Send message Joined: 23 Nov 19 Posts: 4 Credit: 6,597,088 RAC: 79,816 |
Thanks! Fingers crossed then. I promise to open my gates for new tasks once the upload queue here is gone. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Thanks!One of my machines has started downloading again, so that's a good sign the others will be soon (as long as the server stays up). |
Send message Joined: 23 Nov 19 Posts: 4 Credit: 6,597,088 RAC: 79,816 |
Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then.I am hoping all will be resolved by the end of play today or at least a significant dent made in the number of tasks needing to be uploaded. Over 300 went through in the short time the gate was open yesterday, I haven't checked exactly how many. One of my boxes completed all uploads, the second however is now stuck with 2(sic) out of those thousands result files. Is/will the "gate" be opened permanently? After all, it seems to be the sole prupose of an upload server to be open for uploads? |
©2024 cpdn.org