Message boards : Number crunching : The uploads are stuck
Message board moderation
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 · Next
Author | Message |
---|---|
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
shrug Ok. My point is mostly that a random desktop on a gigabit link could serve as a backup upload server while the core one is acting up. I don't know what the infrastructure looks like, I just know that this shiny new cloud provider, apparently selected to be an improvement on previous stuff, isn't living up to expectations, at the cost of an awful lot of compute. And WCG can't feed tasks either right now (they assign them, but most of them don't download), and I'm not that interested in asteroids. I heat my office on compute with surplus power, and that's gotten really hit and miss lately. :( |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
And WCG can't feed tasks either right now (they assign them, but most of them don't download), Just got a bucket full of WCG with some ARP w/u as well. They do download after a while with a bit of prodding. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Glen has posted in OpenIFS discussion
|
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001 |
Glenn has posted in OpenIFS discussionYou beat me to it - I was just coming here! I don't know how many concurrent connections they will set initially for the upload server but they are aware tasks are timing out, they will increase it as soon as they can.Update: Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks). |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
You beat me to it - I was just coming here! I don't know how many concurrent connections they will set initially for the upload server but they are aware tasks are timing out, they will increase it as soon as they can. ~Earliest deadlines on my tasks are 19th Feb so I may wait till things calm down a bit before adding to the fray. Edit: Changed my mind because my uploads take so long and they are now working again. One task running and first 8 uploads have gotten through. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001 |
shrug Ok. My point is mostly that a random desktop on a gigabit link could serve as a backup upload server while the core one is acting up. I don't know what the infrastructure looks like, I just know that this shiny new cloud provider, apparently selected to be an improvement on previous stuff, isn't living up to expectations, at the cost of an awful lot of compute.That wouldn't be enough. Data was uploading at ~250Mb/s, a gigaBit link is only ~125Mbytes/s (at best). At that data rate, 1Tb is full in ~1hr. That's why cloud infrastructure is needed for these big projects. And this is one of the biggest they've done. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
My uploads are working again, 8 have gone through so far which means I would guess, some have managed to get 800 or more through. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
Mine have started too, from the machine I plan to upgrade tomorrow. But the congestion is apparent, with both connections stalled as I type, so it'll be a while before this is completely over. |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
Last night my oifs upload backlog could be cleared and all finished tasks could successfully be reported. ATM all tasks in progress can upload their trickles before the next one appears. One guy less fighting for an upload slot. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,732,321 RAC: 6,894 |
I allowed one machine to upload yesterday, and that has completely cleared itself overnight - four days of work in less than 24 hours. So now I've allowed the second machine to take its turn on the pipe. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,089,551 RAC: 14,948 |
All my waiting uploads cleared overnight. Have resumed computing. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001 |
Reminder to reset <ncpus> tag in cc_config.xml if you changed it If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>. There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863. It would save CPDN trawling through their database to find these hosts and contact their owners. Thanks! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
It would save CPDN trawling through their database to find these hosts and contact their owners.My guess is that there will be few if any who don't read the forums who have done it. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,974,870 RAC: 38,708 |
There are some more OpenIFS batches coming soonHopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ? Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
There are some more OpenIFS batches coming soonHopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ? Pretty certain computer requirements will be pretty much identical to the last several batches. I am hoping that any remaining issues with the cloud storage will be resolved by then but if there are outages at the upload server then again, disk space will become an issue. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I got a "new" task about four and a half hours ago today that seems to be running just fine. Its "trickles" seem to go up instantly too -- in five seconds -- a few take six seconds.. The previous two clients failed with this one.. Thu 26 Jan 2023 03:51:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_39.zip Thu 26 Jan 2023 03:51:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_39.zip |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The previous two clients failed with this one.. I am always pleased when previous machines failed on work units and then to have my machine complete it correctly. Fri 27 Jan 2023 01:15:36 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_122.zip Fri 27 Jan 2023 01:15:42 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_122.zip Fri 27 Jan 2023 01:17:37 AM EST | climateprediction.net | Computation for task oifs_43r3_ps_0278_1988050100_123_957_12173922_2 finished Task 22303962 Name oifs_43r3_ps_0278_1988050100_123_957_12173922_2 Workunit 12173922 Created 26 Jan 2023, 16:25:01 UTC Sent 26 Jan 2023, 16:25:21 UTC Report deadline 27 Mar 2023, 16:25:21 UTC Received 27 Jan 2023, 6:18:07 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 13 hours 51 min 38 sec CPU time 13 hours 34 min 12 sec Validate state Valid Credit 0.00 Device peak FLOPS 6.06 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu Peak working set size 4,610.14 MB Peak swap size 4,974.14 MB Peak disk usage 1,218.62 MB |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,551,831 RAC: 17,001 |
Yes, we'll give a heads-up on the batches. I noted that request a while ago. The RAM requirements stay the same with the current crop of apps. What will change is runtimes and model output. The batch id is in the workunit name. If one of my workunits has a name: oifs_43r3_ps_0135_2007050100_123_976_12192779_1 batch id is the '976'. I'm sure I've seen a forum post somewhere explaining all the parts of the workunit names. There are some more OpenIFS batches coming soonHopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
batch id is the '976'. I'm sure I've seen a forum post somewhere explaining all the parts of the workunit names.this is the post that tells you about the task names. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Uploading trickles is now going very well. My pile-up of them all completed a day or two ago. Now they go through as fast as they are produced. Note that this one went up in 4 seconds. Mostly, they are taking 5 or 6 seconds to go up. And I am not pressing Retry because they do not stay in the list long enough that I ever see them anymore. Which is just fine with me. Sat 28 Jan 2023 11:25:18 AM EST | climateprediction.net | Sending scheduler request: To send trickle-up message. Sat 28 Jan 2023 11:25:18 AM EST | climateprediction.net | Requesting new tasks for CPU Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0436_2013050100_123_982_12199080_2_r96501905_82.zip Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Scheduler request completed: got 0 new tasks Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Project has no tasks available Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Project requested delay of 3636 seconds Sat 28 Jan 2023 11:25:23 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0436_2013050100_123_982_12199080_2_r96501905_82.zip |
©2024 cpdn.org