Message boards : Number crunching : The uploads are stuck
Message board moderation
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,974,870 RAC: 38,708 |
Here we go again:Same here, we seem to upload faster than the internal processess move files to other places. occasional uploads go through Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have messaged Andy. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
Curious what's your $ per WU. I've also recently checked EC2, GCP or Azure and they all have that nice catch of bandwidth cost. Their bandwidth costs around $0.08-0.1 per GB and that would mean around $0.15 - $0.2 per WU. That alone already exceeds cost per WU for whatever I can get with my own equipment, electricity and home network. Azure covers first 100GB and others' free usage is negligible. I've got a dual core EPYC VM running with 10GB RAM at $11.63/mo. It's running about 20h per task, with two going at any given time: https://www.cpdn.org/results.php?hostid=1538282 So, ballpark 70 WU/month, or $0.17/WU in compute costs, plus, as you note, bandwidth. Probably $0.25/WU. It's certainly more expensive than I manage at home, but I'm also upload bandwidth limited here, so I can't actually run all my systems. It's just something I like to mess with every now and then. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,300,375 RAC: 73,419 |
I've got a dual core EPYC VM running with 10GB RAM at $11.63/mo. It's running about 20h per task, with two going at any given time: https://www.cpdn.org/results.php?hostid=1538282 Thanks. That's around similar number I come up with and all three generally fall into similar $0.4-$0.5/WU range with their cheapest instance types. They are all pretty competitive against each other, but far more expensive than my own setup. Guess this should be totally expected given their machines are loaded with all other cool stuff I don't need plus better network, uptime, etc and they still need to make money. My upload link isn't great either, but enough for now. Hopefully next versions of OpenIFS would have higher compute to bandwidth ratio to make it easier. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
From Andy, Hi Dave, |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
Saw Andy's message, timed at 21:55 last night. It doesn't seem to have made much difference - I'm still in multi-hour project backoffs. I'll check exactly how many uploads are getting through when I've woken up a bit more. On another subject, the generic Climate Prediction home page https://www.climateprediction.net/index.php is giving me an error today: Your PHP installation appears to be missing the MySQL extension which is required by WordPress.PHP and WordPtrss are server-side technologies, so I think it's their installation, rather than my installation. Probably an update went wrong. The cpdn.org/cpdnboinc/ pages are working fine. Edit - one machine got a 5-minute burst of uploads around 22:45 last night, and another around 04:45 this morning, but nothing since then. (A total of 100 files across the two bursts) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have just seen one zip go through for me in past hour. And I get the same as you on the wordpress thing. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,311,890 RAC: 633 |
Richard Haselgrove wrote: Saw Andy's message, timed at 21:55 last night. It doesn't seem to have made much difference - I'm still in multi-hour project backoffs. I'll check exactly how many uploads are getting through when I've woken up a bit more. [...] Edit - one machine got a 5-minute burst of uploads around 22:45 last night, and another around 04:45 this morning, but nothing since then. (A total of 100 files across the two bursts)The situation turned from a certain portion of transfers failing (and going into retry loops), to a large portion of connection requests being rejected. Ever since the upload server was revived, it is evidently working near or at its throughput limit and only the details of how it is coping are slightly varying over time. From the project's infrastructure point of view there is one good aspect of this: The upload infrastructure is well utilized (as long as it doesn't go down like on Christmas eve and during the first recovery attempt, or attempts, in early January). For us client operators it's of course bad because we have to constantly watch and possibly re-adjust the compute clients in order to prevent too large transfer backlogs or even outright task failures in case of lack of disk space. The client can deal with a situation like this somewhat on its own, but not particularly well. SolarSyonyk wrote: I've got a dual core EPYC VM running with 10GB RAM at $11.63/mo. It's running about 20h per task, with two going at any given time: https://www.cpdn.org/results.php?hostid=1538282So either you will be lucky and the upload server availability recovers soon enough. Or you will need to go through hoops to add storage to the VM while it is up and running. Or you will have to suspend the unstarted tasks and wait for the running tasks to complete and then shut the VM down. Or you could shut down the VM right away and risk the tasks erroring out after resumption. Or you could suspend the VM at extra charge for the provider's storing your VM state. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,915,528 RAC: 15,795 |
So I just noticed a problem that I thought wasn't going to happen. Tasks are timing out (due to upload issues) and are being resent to new users. The 30 day grace period setting doesn't seem to be working, or at least not in the way I'd expect it to. Richard, for example, you have a bunch of tasks like that. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
So I just noticed a problem that I thought wasn't going to happen. Tasks are timing out (due to upload issues) and are being resent to new users. The 30 day grace period setting doesn't seem to be working, or at least not in the way I'd expect it to. Richard, for example, you have a bunch of tasks like that.Yes, I'm aware of those - they're the result of an educational experiment I carried out for Glenn, which went wrong for unexpected reasons. Those tasks are lost, and wouldn't have been returned even if there had been a grace period - I was going to suggest they should be resent immediately, so I'm glad to see that's happened. But interestingly, the resent tasks have been given a two month deadline. Looks like the project considered the 'grace period' route, but decided to handle the problem in the traditional way instead. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,915,528 RAC: 15,795 |
they're the result of an educational experiment I carried out for Glenn, which went wrong for unexpected reasons it seems to be more than that as I got a couple of tasks as resends from 2 different users who have a bunch of timed out tasks. I did notice the 2 month deadline on newly downloaded tasks. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
it seems to be more than that as I got a couple of tasks as resends from 2 different users who have a bunch of timed out tasks. I did notice the 2 month deadline on newly downloaded tasks.Ah - had another look. I've downloaded a number of tasks since the new upload failure struck yesterday afternoon, and all of them (resends and initial _0 replications) show:
So it IS a grace period, but the BOINC server must only apply it to newly issued tasks after the configuration change - not to tasks already 'in the field'. |
Send message Joined: 1 Nov 04 Posts: 185 Credit: 4,166,063 RAC: 857 |
I got this with the last .zip for one WU: So 22 Jan 2023 12:27:41 CET | climateprediction.net | Started upload of oifs_43r3_ps_0943_2008050100_123_977_12194587_0_r748317941_122.zip So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: 17 bytes stray data read before trying h2 connection So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: Hostname upload11.cpdn.org was found in DNS cache So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: Trying 192.171.169.187:80... So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: TCP_NODELAY set So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: connect to 192.171.169.187 port 80 failed: Connection refused So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: Failed to connect to upload11.cpdn.org port 80: Connection refused So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] [ID#8542] Info: Closing connection 4249 So 22 Jan 2023 12:27:42 CET | climateprediction.net | [http] HTTP error: Couldn't connect to server So 22 Jan 2023 12:27:42 CET | | Project communication failed: attempting access to reference site So 22 Jan 2023 12:27:42 CET | | [http] HTTP_OP::init_get(): https://www.google.com/ So 22 Jan 2023 12:27:42 CET | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0943_2008050100_123_977_12194587_0_r748317941_122.zip: connect() failed So 22 Jan 2023 12:27:42 CET | climateprediction.net | Backing off 04:01:24 on upload of oifs_43r3_ps_0943_2008050100_123_977_12194587_0_r748317941_122.zip So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Info: Found bundle for host www.google.com: 0x55c2db0a5570 [can multiplex] So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Info: Re-using existing connection! (#4229) with host www.google.com So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Info: Connected to www.google.com (142.250.181.196) port 443 (#4229) So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Info: Using Stream ID: 5 (easy handle 0x55c2db407f10) So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: GET / HTTP/2 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: Host: www.google.com So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: user-agent: BOINC client (x86_64-pc-linux-gnu 7.20.5) So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: accept: */* So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: accept-encoding: deflate, gzip, br So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: accept-language: de_DE So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Sent header to server: So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: HTTP/2 200 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: date: Sun, 22 Jan 2023 11:27:43 GMT So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: expires: -1 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: cache-control: private, max-age=0 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: content-type: text/html; charset=ISO-8859-1 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: cross-origin-opener-policy-report-only: same-origin-allow-popups; report-to="gws" So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: report-to: {"group":"gws","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/gws/other"}]} So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info." So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: content-encoding: gzip So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: server: gws So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: content-length: 6646 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: x-xss-protection: 0 So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: x-frame-options: SAMEORIGIN So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: set-cookie: SOCS=CAAaBgiA-bGeBg; expires=Wed, 21-Feb-2024 11:27:43 GMT; path=/; domain=.google.com; Secure; SameSite=lax So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: set-cookie: AEC=ARSKqsJ2tJfYwIj7BRqYZCzCjnL9TNokU0KMa-wLLyh3nOCeOM4qvp504A; expires=Fri, 21-Jul-2023 11:27:43 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: set-cookie: __Secure-ENID=9.SE=EIkGuGkKLNZ5WKFdwpMeNVaclpvMIgOG-OaWak1yb9XtPFrIIK4_dt8Vp8tD7ooRuf7gyd5_8ydovUxE4FRoo40fM6BrbZZVjBD9t5nxmgoPH8vz98e04Z0EDsd-l37wrYtYCugw3LLFNaWKIDO4SAcE5mTGgJ4MYA5Tblb_s1A; expires=Thu, 22-Feb-2024 03:46:01 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: set-cookie: CONSENT=PENDING+503; expires=Tue, 21-Jan-2025 11:27:43 GMT; path=/; domain=.google.com; Secure So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43" So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Received header from server: So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 2513 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 3141 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 3320 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 3471 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 1420 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 218 bytes So 22 Jan 2023 12:27:43 CET | | [http_xfer] [ID#0] HTTP: wrote 931 bytes So 22 Jan 2023 12:27:43 CET | | [http] [ID#0] Info: Connection #4229 to host www.google.com left intact So 22 Jan 2023 12:27:43 CET | | Internet access OK - project servers may be temporarily down. While at the same time 3 new WUs got downloaded without any problem. Grüße vom Sänger |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
Mine went through a cycle: 22/01/2023 09:37:47 | climateprediction.net | Finished upload of oifs_43r3_ps_0030_2007050100_123_976_12192674_1_r1655111617_31.zip 22/01/2023 09:37:48 | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0431_2016050100_123_985_12202075_0_r1887730273_18.zip: transient HTTP error 22/01/2023 09:44:33 | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0041_1996050100_123_965_12181685_0_r1930143653_13.zip: connect() failedThere was another batch of uploads from 09:33 to 09:37, at normal speed. Then, a group of HTTP errors after long timeouts - I think we interpret that as an upload server crash. Finally, a series of almost-instantaneous connect failures, which are continuing as I type. It does seem that the upload server doesn't anticipate the local disks filling up very well, and then takes an extraordinarily long time to pass on the surplus, and free enough space to allow normal service to be resumed. And then, it fills up again very quickly. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
It does seem that the upload server doesn't anticipate the local disks filling up very well, and then takes an extraordinarily long tome to pass on the surplus, and free enough space to allow normal service to be resumed. And then, it fills up again very quickly.Only one has gotten through for me this morning since I first looked about 0700UTC. I think it is going to be a long wait again. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
So either you will be lucky and the upload server availability recovers soon enough. Or you will need to go through hoops to add storage to the VM while it is up and running. Or you will have to suspend the unstarted tasks and wait for the running tasks to complete and then shut the VM down. Or you could shut down the VM right away and risk the tasks erroring out after resumption. Or you could suspend the VM at extra charge for the provider's storing your VM state. I just waited for the tasks to finish and shut the VM down. No point in paying to process units that may or may not get where they need to go. Doing the same for my onsite boxes, just going to let them finish and then wait until uploads clear out before resuming. It's becoming more hassle than it's worth to try and work around upload failures, so I may just put CPU cycles at something that can take uploads, or I may just shut them down for a while. It's really hard to dig out from a backlog - I only barely have the bandwidth to keep up with production, and I had machines running but with tasks suspended for quite a few days to try and get WUs up before deadlines. If the infrastructure isn't stable, I'm done with heroics to try and work around it. I simply will wait until stuff works before bothering again. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am confused. I used to go to climateprediction.net to get here and yesterday evening that failed. I could not even get anywhere. I had to change it to cpdn.org to get here today. Could that be why I cannot upload anything? Checking if climateprediction.net is down or it is just you... It's not just you! climateprediction.net is down. Sun 22 Jan 2023 11:29:25 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0012_2009050100_123_978_12194656_0_r313555412_87.zip Sun 22 Jan 2023 11:29:28 AM EST | | Project communication failed: attempting access to reference site Sun 22 Jan 2023 11:29:28 AM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0012_2009050100_123_978_12194656_0_r313555412_87.zip: connect() failed Sun 22 Jan 2023 11:29:28 AM EST | climateprediction.net | Backing off 00:02:48 on upload of oifs_43r3_ps_0012_2009050100_123_978_12194656_0_r313555412_87.zip Sun 22 Jan 2023 11:29:30 AM EST | | Internet access OK - project servers may be temporarily down. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,311,890 RAC: 633 |
Saenger wrote: I got this with the last .zip for one WU: [...] Jean-David Beyer wrote: I am confused.There are (at least) four physically different servers:
(Actually, it is related to the CPDN BOINC functions in the way that the BOINC project URL is named www.climateprediction.net too. I suppose it is impossible to attach new clients to CPDN for as long as this web server is down.)
Expect this sort of unavailability to happen again and again until the current OpenIFS work is done. (Unless CPDN can afford a storage subsystem which has magnitudes more temporary space, or can set up a magnitudes faster outbound data link from the upload file handler to backing store.
|
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,974,870 RAC: 38,708 |
HM,guys, I think in the moment CPDN seems to have more crunching power than the infrastructur can handle. So, I think, it is better for the project if I pause CPDN-Crunching for quit a while, until the infrastructure can handle the load. For now, I let all my clients finish already downloaded tasks, but not download any new. Supporting BOINC, a great concept ! |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,915,528 RAC: 15,795 |
HM,guys, I think in the moment CPDN seems to have more crunching power than the infrastructur can handle. Funny you say that as there's been talk of trying to increase the user base by making a VBox app. It sure seems like the project is finding out the hard way that introducing a new model type (OIFS) is not that easy and isn't a small undertaking. The biggest problem currently is the upload situation. There's also the credit issue that's started before the upload issue and more recently RAC issue. The main website is down too. I hope we can at least finish out this contract and get everything processed and uploaded by the end of February. |
©2024 cpdn.org