climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 · Next

AuthorMessage
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 68025 - Posted: 25 Jan 2023, 0:29:49 UTC

shrug Ok. My point is mostly that a random desktop on a gigabit link could serve as a backup upload server while the core one is acting up. I don't know what the infrastructure looks like, I just know that this shiny new cloud provider, apparently selected to be an improvement on previous stuff, isn't living up to expectations, at the cost of an awful lot of compute.

And WCG can't feed tasks either right now (they assign them, but most of them don't download), and I'm not that interested in asteroids. I heat my office on compute with surplus power, and that's gotten really hit and miss lately. :(
ID: 68025 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 68026 - Posted: 25 Jan 2023, 1:12:52 UTC - in response to Message 68025.  

And WCG can't feed tasks either right now (they assign them, but most of them don't download),


Just got a bucket full of WCG with some ARP w/u as well. They do download after a while with a bit of prodding.
ID: 68026 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68040 - Posted: 25 Jan 2023, 11:13:20 UTC

Glen has posted in OpenIFS discussion

Update:
Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks).
ID: 68040 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 68041 - Posted: 25 Jan 2023, 11:21:50 UTC - in response to Message 68040.  

Glenn has posted in OpenIFS discussion
Update: Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks).
You beat me to it - I was just coming here! I don't know how many concurrent connections they will set initially for the upload server but they are aware tasks are timing out, they will increase it as soon as they can.
ID: 68041 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68042 - Posted: 25 Jan 2023, 11:39:03 UTC - in response to Message 68041.  
Last modified: 25 Jan 2023, 11:57:15 UTC

You beat me to it - I was just coming here! I don't know how many concurrent connections they will set initially for the upload server but they are aware tasks are timing out, they will increase it as soon as they can.


~Earliest deadlines on my tasks are 19th Feb so I may wait till things calm down a bit before adding to the fray.

Edit: Changed my mind because my uploads take so long and they are now working again. One task running and first 8 uploads have gotten through.
ID: 68042 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 68043 - Posted: 25 Jan 2023, 11:42:02 UTC - in response to Message 68025.  

shrug Ok. My point is mostly that a random desktop on a gigabit link could serve as a backup upload server while the core one is acting up. I don't know what the infrastructure looks like, I just know that this shiny new cloud provider, apparently selected to be an improvement on previous stuff, isn't living up to expectations, at the cost of an awful lot of compute.
That wouldn't be enough. Data was uploading at ~250Mb/s, a gigaBit link is only ~125Mbytes/s (at best). At that data rate, 1Tb is full in ~1hr. That's why cloud infrastructure is needed for these big projects. And this is one of the biggest they've done.
ID: 68043 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68044 - Posted: 25 Jan 2023, 11:58:22 UTC

My uploads are working again, 8 have gone through so far which means I would guess, some have managed to get 800 or more through.
ID: 68044 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,732,321
RAC: 6,894
Message 68045 - Posted: 25 Jan 2023, 12:11:23 UTC

Mine have started too, from the machine I plan to upgrade tomorrow. But the congestion is apparent, with both connections stalled as I type, so it'll be a while before this is completely over.
ID: 68045 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 1,065,239
RAC: 556
Message 68055 - Posted: 26 Jan 2023, 7:27:40 UTC

Last night my oifs upload backlog could be cleared and all finished tasks could successfully be reported.
ATM all tasks in progress can upload their trickles before the next one appears.

One guy less fighting for an upload slot.
ID: 68055 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,732,321
RAC: 6,894
Message 68056 - Posted: 26 Jan 2023, 7:48:24 UTC - in response to Message 68055.  

I allowed one machine to upload yesterday, and that has completely cleared itself overnight - four days of work in less than 24 hours.

So now I've allowed the second machine to take its turn on the pipe.
ID: 68056 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,089,551
RAC: 14,948
Message 68058 - Posted: 26 Jan 2023, 8:57:01 UTC

All my waiting uploads cleared overnight. Have resumed computing.
ID: 68058 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 68060 - Posted: 26 Jan 2023, 10:52:21 UTC

Reminder to reset <ncpus> tag in cc_config.xml if you changed it

If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.

There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863.

It would save CPDN trawling through their database to find these hosts and contact their owners.

Thanks!
ID: 68060 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68064 - Posted: 26 Jan 2023, 12:46:06 UTC - in response to Message 68060.  

It would save CPDN trawling through their database to find these hosts and contact their owners.
My guess is that there will be few if any who don't read the forums who have done it.
ID: 68064 · Report as offensive     Reply Quote
Yeti

Send message
Joined: 5 Aug 04
Posts: 178
Credit: 18,974,870
RAC: 38,708
Message 68065 - Posted: 26 Jan 2023, 13:06:13 UTC - in response to Message 68060.  

There are some more OpenIFS batches coming soon
Hopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ?
Supporting BOINC, a great concept !
ID: 68065 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68066 - Posted: 26 Jan 2023, 13:52:10 UTC - in response to Message 68065.  

There are some more OpenIFS batches coming soon
Hopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ?


Pretty certain computer requirements will be pretty much identical to the last several batches. I am hoping that any remaining issues with the cloud storage will be resolved by then but if there are outages at the upload server then again, disk space will become an issue.
ID: 68066 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68067 - Posted: 26 Jan 2023, 21:05:40 UTC - in response to Message 68066.  

I got a "new" task about four and a half hours ago today that seems to be running just fine.
Its "trickles" seem to go up instantly too -- in five seconds -- a few take six seconds..

The previous two clients failed with this one..

Thu 26 Jan 2023 03:51:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_39.zip

Thu 26 Jan 2023 03:51:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_39.zip

ID: 68067 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68074 - Posted: 27 Jan 2023, 6:33:44 UTC - in response to Message 68067.  
Last modified: 27 Jan 2023, 6:34:20 UTC

The previous two clients failed with this one..


I am always pleased when previous machines failed on work units and then to have my machine complete it correctly.

Fri 27 Jan 2023 01:15:36 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_122.zip
Fri 27 Jan 2023 01:15:42 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0278_1988050100_123_957_12173922_2_r213940714_122.zip
Fri 27 Jan 2023 01:17:37 AM EST | climateprediction.net | Computation for task oifs_43r3_ps_0278_1988050100_123_957_12173922_2 finished


Task 22303962
Name 	oifs_43r3_ps_0278_1988050100_123_957_12173922_2
Workunit 	12173922
Created 	26 Jan 2023, 16:25:01 UTC
Sent            26 Jan 2023, 16:25:21 UTC
Report deadline 27 Mar 2023, 16:25:21 UTC
Received        27 Jan 2023, 6:18:07 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	13 hours 51 min 38 sec
CPU time 	13 hours 34 min 12 sec
Validate state 	Valid
Credit 	        0.00
Device peak FLOPS 	6.06 GFLOPS
Application version 	OpenIFS 43r3 Perturbed Surface v1.05
                        x86_64-pc-linux-gnu
Peak working set size 	4,610.14 MB
Peak swap size 	        4,974.14 MB
Peak disk usage         1,218.62 MB

ID: 68074 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 68084 - Posted: 27 Jan 2023, 15:48:08 UTC - in response to Message 68065.  

Yes, we'll give a heads-up on the batches. I noted that request a while ago. The RAM requirements stay the same with the current crop of apps. What will change is runtimes and model output.

The batch id is in the workunit name. If one of my workunits has a name:
oifs_43r3_ps_0135_2007050100_123_976_12192779_1

batch id is the '976'. I'm sure I've seen a forum post somewhere explaining all the parts of the workunit names.

There are some more OpenIFS batches coming soon
Hopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ?
ID: 68084 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68098 - Posted: 28 Jan 2023, 11:57:49 UTC - in response to Message 68084.  
Last modified: 28 Jan 2023, 11:59:25 UTC

batch id is the '976'. I'm sure I've seen a forum post somewhere explaining all the parts of the workunit names.
this is the post that tells you about the task names.
ID: 68098 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68099 - Posted: 28 Jan 2023, 16:40:25 UTC

Uploading trickles is now going very well. My pile-up of them all completed a day or two ago. Now they go through as fast as they are produced.
Note that this one went up in 4 seconds. Mostly, they are taking 5 or 6 seconds to go up. And I am not pressing Retry because they do not stay in the list long enough that I ever see them anymore. Which is just fine with me.
Sat 28 Jan 2023 11:25:18 AM EST | climateprediction.net | Sending scheduler request: To send trickle-up message.
Sat 28 Jan 2023 11:25:18 AM EST | climateprediction.net | Requesting new tasks for CPU
Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0436_2013050100_123_982_12199080_2_r96501905_82.zip
Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Scheduler request completed: got 0 new tasks
Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Project has no tasks available
Sat 28 Jan 2023 11:25:19 AM EST | climateprediction.net | Project requested delay of 3636 seconds
Sat 28 Jan 2023 11:25:23 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0436_2013050100_123_982_12199080_2_r96501905_82.zip

ID: 68099 · Report as offensive     Reply Quote
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org