climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 25 · Next

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,915,528
RAC: 15,795
Message 67742 - Posted: 15 Jan 2023, 12:12:55 UTC - in response to Message 67738.  

Does somebody know when full bandwith will be available again?

Thursday at the earliest. That's based on info from a status update post by project people higher up this thread.

There's been talk about how to deal with upcoming deadlines. Glenn has been looking at options but may not have decided what to do yet. I highly doubt you'll loose your work even if the deadline passes.

Overall there's been progress uploading files for many users. Unfortunately, it does seem like there a few users, like you, who are having difficulty making any real progress. It might be that as a group, you all will be the last ones to finally get rid of your files.
ID: 67742 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,311,890
RAC: 633
Message 67744 - Posted: 15 Jan 2023, 13:40:17 UTC - in response to Message 67716.  
Last modified: 15 Jan 2023, 13:52:25 UTC

Richard Haselgrove wrote:
Richard Haselgrove wrote:
I've found a technique which seems to help. Go through this sequence:

  • Suspend network activity (BOINC Manager, Advanced view, Activity menu)
  • Retry all transfers (Tools Menu)
  • Allow network activity

That just cleared the last six tasks due by 25 January, on one machine.

And it's just worked on my second machine as well - tied off the loose ends from 14 tasks in a single hour.

14/01/2023 17:12:27 | climateprediction.net | Reporting 14 completed tasks
I usually wait until both connections have stalled, and the queue has gone into 'project backoff'. Not sure if that's a significant part of the procedure, but it can't spoil it.
I just tried it once on two computers, and it did not set anything in motion.


Jean-David Beyer wrote:
I am getting pretty good response from the upload server, though not as good as it was about 10 (?) days ago. I have high speed (75 megabit/second) fiber optic Internet connection, but I am in USA and the server is in England. So right now, traceroute does not make it all the way to the server. It did recently.

But notice the big delay from New York to London. Step 8 to step 9. This is usual and unchanged. IIRC, the server is at about step 22.
[...]
16  ral-r26.ja.net (146.97.41.34)  80.929 ms  83.553 ms  79.803 ms
wujj123456 wrote:
Thanks for the traceroute output. The server doesn't respond to ICMP packets, probably blocked for security. ral-r26.ja.net seems to be the last hop everyone sees from traceroute. Your latency is pretty low, compared to my 140-150ms to reach ral-r26.ja.net.
My roundtrip times are even lower:
14  ral-r26.ja.net (146.97.41.34)  34.969 ms  34.954 ms  34.178 ms
Yet I don't get anything uploaded anyway.
(To be precise, one computer of mine uploaded 86 files and the other computer 41 files, since Wednesday night.)
ID: 67744 · Report as offensive     Reply Quote
Boone

Send message
Joined: 8 Aug 05
Posts: 3
Credit: 13,689,587
RAC: 4,982
Message 67745 - Posted: 15 Jan 2023, 13:43:41 UTC - in response to Message 67742.  

Good day,
Is that knowledge or conjecture that it is only a minority who are waiting for an upload here?
I am now also reporting here for the first time and was also good hope, but these statements no longer leave me cold.
On my side 85GB waiting to upload. I had once for 2h opportunity to upload 30GB, but that was already last week.
If the earliest time was really Thursday, then my WUs with expiration date 01/20/2023 would most likely be overdue if I uploaded them then.
Unless actively prevented from resending tasks of my WUs, I expect my work would not be honored. And I don't read anything here about that plan being in place.
And yes, like stony666, I have followed all the tips, reduced connections to 2 per project, even restarted my router to get a new IP to end this wait.
And basically:
I am surprised that on the project team side and their infrastructure, the OIFS tasks were so underestimated in their severity. The number of waiting Linux computers with the required prerequisites was clear, also clear was how many WUs will be sent out, how big the WUs are, how fast the data will get back.
In my estimation, everything was clear and could be estimated beforehand. If it was so professional on the project team side and their infrastructure, how could that be?
I was astonished that so many batches were released for calculation before Christmas and had respect for the infrastructure that is supposed to be able to cope with this.
Well, that was probably too optimistic?

Please, please, let the (green) power and the money spent on it not have been spent in vain for WUs that are not evaluated.

Thank you

Translated with www.DeepL.com/Translator (free version)
ID: 67745 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,075,151
RAC: 374
Message 67746 - Posted: 15 Jan 2023, 14:37:36 UTC - in response to Message 67745.  

Please keep in mind, the CPDN Team estimated that the server will keep up, according to specs from JASMIN, which runs the datacenter. So CPDN Team can not be charged for this one.

I just blocked all uploads from my host via hosts file, because i can get an slot in maximum 5 minutes. Hopefully some other hosts can get this slot and use it for older WUs.
As far as i can see on the server status page, many 1000s of WUs got progressed and uploaded, both number in progress and to send are decreasing.

Greets from Germany
Felix
ID: 67746 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 67751 - Posted: 15 Jan 2023, 16:59:08 UTC
Last modified: 15 Jan 2023, 17:01:32 UTC

For the folks that say the problem is on my end, that's just not true. I've been doing this BOINC thing since the beginning. I know what I am doing, and all my machines are doing just fine on all the other projects. The problem is with the CPDN side, and has not been resolved to date. And I am not alone, as you can see with some of the other volunteers here in this thread. And yes, I did try the "trick" posted a few days ago. Like Richard, it made no difference. My uploads cannot upload. Not slowly, not occasionally, not at all.

FWIW, the current error (which has changed over time) is

23727	climateprediction.net	1/15/2023 8:44:31 AM	Temporarily failed upload of oifs_43r3_ps_0655_2007050100_123_976_12193299_0_r455876171_113.zip: transient HTTP error	
23728	climateprediction.net	1/15/2023 8:44:31 AM	Backing off 05:48:34 on upload of oifs_43r3_ps_0655_2007050100_123_976_12193299_0_r455876171_113.zip	
23729			1/15/2023 8:44:33 AM	Internet access OK - project servers may be temporarily down.	
ID: 67751 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,731,493
RAC: 6,912
Message 67756 - Posted: 15 Jan 2023, 17:36:47 UTC - in response to Message 67751.  

Which "trick" are you guys referring to? The one I first posted in message 67703, guoted by xii5ku?

That technique wasn't designed to, and won't help with, getting a connection in the first place. It was intended - assuming you're getting connections reasonably often - to dispose of the oldest uploads first. Those are the ones which got stuck the first time round, and were sent to the back of the queue. It helps to get the oldest tasks - the ones closest to deadline - safely reported back to base.
ID: 67756 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,915,528
RAC: 15,795
Message 67766 - Posted: 16 Jan 2023, 3:28:59 UTC

I finally got all of my files uploaded, 100%. That's one less user to compete for the limited amount of connection slots.

For those concerned with missing deadlines, Glenn is aware of the possibility so I highly doubt anyone here will loose their work or not get credit.

I think someone has mentioned this before but limiting the number of uploads per project doesn't seem to make a difference. It might actually slow things down. I decided to try and temporarily increase mine and the total upload throughput seems to have increased. Kind of like running several tasks concurrently.
ID: 67766 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,731,493
RAC: 6,912
Message 67767 - Posted: 16 Jan 2023, 10:08:54 UTC - in response to Message 67766.  

I decided to go the other way. The bulk of my backlog has cleared (i'm well into the February deadlines now), so I've dropped down from 5 to 2 tasks running per machine, and from 2 to 1 connections. There are frequent pauses in uploading (when I hope others are getting a turn at the trough), but overall the direction of travel is positive.

I'll keep trying to push the uploads through, particular the odd ones and twos that are preventing a whole task reporting, but when they're gone, I'll shut down for a short pause while I return all the BOINC settings to normal, and catch up on overdue maintenance. Then, back into the fray with a new batch.
ID: 67767 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67768 - Posted: 16 Jan 2023, 10:59:12 UTC - in response to Message 67767.  
Last modified: 16 Jan 2023, 14:25:57 UTC

I am down to three uploading now. I pushed a bunch through tethering my phone, earliest due is 4th Feb so have switched back to bored band.

Edit: 2 now and looking at server status page the sum of tasks in progress and tasks ready to send is dropping by about 100/hour.
ID: 67768 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 67770 - Posted: 16 Jan 2023, 15:18:55 UTC

Update from CPDN meeting 16/1/23

Upload http connections will be increased over the next 2 days back to the previous maximum of 300 simultaneous connections as most of the previous uploaded data has now been moved off.

The 'grace period' for return of tasks has been changed from 'nothing' to 30 days. CPDN hope this and the increase in httpd connections will be enough for those remaining tasks to upload in time. Thanks to Richard for mentioning the grace period options.
ID: 67770 · Report as offensive     Reply Quote
Stony666

Send message
Joined: 9 Feb 21
Posts: 9
Credit: 10,689,509
RAC: 3,567
Message 67773 - Posted: 16 Jan 2023, 15:47:25 UTC - in response to Message 67770.  

Thanks for the grace period update!! :)

More then 100 WUs were uploaded. Hopefully the rest will be transferred until the next few days.

Regards Jörg
ID: 67773 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,731,493
RAC: 6,912
Message 67774 - Posted: 16 Jan 2023, 15:52:28 UTC - in response to Message 67770.  

All sounds positive and sensible. I don't know if the first steps towards increasing the number of connections have been taken yet, but it seems to be running even more smoothly now, even compared with this morning.

There may still be one problem left: people who ran into the problem caused by the limit on uploads in progress. That can be very slow to recover from, if all tasks have been crunched and no new uploads are being generated. When the major rush has died down, it might be worth checking to see if there seem to be any hosts being very slow to clear tasks in progress. You could send out a notice to draw attention to the manual retry options - just a single click every hour or two can make a huge difference.
ID: 67774 · Report as offensive     Reply Quote
Yeti

Send message
Joined: 5 Aug 04
Posts: 178
Credit: 18,974,870
RAC: 38,708
Message 67775 - Posted: 16 Jan 2023, 16:40:10 UTC

In the last 4 hours I could upload all my backlog, but I have switched <max_file_xfers_per_project>2</max_file_xfers_per_project> from 2 to 5 in cc_config. From this moment on the upload went up much better than before
Supporting BOINC, a great concept !
ID: 67775 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 42,294,577
RAC: 73,464
Message 67776 - Posted: 16 Jan 2023, 18:03:35 UTC

Thanks for the update. I finally started getting decent progress in last few hours and upload more than a thousand trickle files so far. However, no WU is ready to report. I thought boinc client should have prioritized the files closer to deadline? Even FIFO should have got the earlier WUs out by now. On one of my hosts, I have a bunch of files at 100% progress but with upload pending status and never retry. Has anyone seen this before? With 30 days of grace period, I should be able to drain all WUs, so hopefully at worst they would be forced to retry then...
ID: 67776 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,731,493
RAC: 6,912
Message 67777 - Posted: 16 Jan 2023, 18:15:39 UTC - in response to Message 67776.  

However, no WU is ready to report. I thought boinc client should have prioritized the files closer to deadline?
BOINC prioritises tasks by deadlines, but doesn't do the same for individual files. When uploading, it doesn't bother to look at the deadline for the parent task. Another one for the 'lessons learned' wrap-up?

All is not lost. I posted a workround in message 67703, which has worked for me several times over the weekend.
ID: 67777 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,311,890
RAC: 633
Message 67778 - Posted: 16 Jan 2023, 18:19:24 UTC

Since this morning (UTC+1 time zone) my two hosts are uploading continuously, with only a very small portion of transfers still getting stuck. But these remaining hickups don't decrease my effective upload rate anymore, which is now limited by my own internet uplink bandwidth again.

_________________________________

Overall progress:



purple: oifs_43r3_ps ready-to-send
yellow: oifs_43r3_ps in-progress

Cumulation of both data:



yellow: oifs_43r3_ps to-be-done (ready-to-send + in-progress)
purple: oifs_43r3_ps ready-to-send

Source: grafana.kiska.pw/d/boinc/boinc (made by Kiska)
ID: 67778 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,551,831
RAC: 17,001
Message 67779 - Posted: 16 Jan 2023, 20:54:13 UTC - in response to Message 67776.  

Thanks for the update. I finally started getting decent progress in last few hours and upload more than a thousand trickle files so far. However, no WU is ready to report. I thought boinc client should have prioritized the files closer to deadline? Even FIFO should have got the earlier WUs out by now.
Unfortunately, once the openifs task has handed over the upload file(s) to boinc client, you're at the mercy of client in terms of upload priority. It would be great if the client did prioritize older WUs first, shame it doesn't.

What I had to do more than a few times is manually 'retry now' the last remaining uploads for a WorkUnit to get it completely uploaded. You may already know but if the workunit name is, say, oifs_43r3_ps_0646_1986050100_123_955_12172290_1, the '12172290' is the WU id, look for that in the upload filenames and then keep hitting 'retry now' in boincmgr (or similar) to transfer them. The last file will be numbered _122.zip on the end (122 files in total per WU for these current batches). And do it with a cup of coffee (or a gin... :)

On one of my hosts, I have a bunch of files at 100% progress but with upload pending status and never retry. Has anyone seen this before? With 30 days of grace period, I should be able to drain all WUs, so hopefully at worst they would be forced to retry then...
Yes, I had that happen alot to me when the upload server was accepting no more than 50 concurrent connections. I left them alone and they did eventually sort themselves out.
ID: 67779 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 67781 - Posted: 16 Jan 2023, 21:47:24 UTC

Unfortunately, once the openifs task has handed over the upload file(s) to boinc client, you're at the mercy of client in terms of upload priority. It would be great if the client did prioritize older WUs first, shame it doesn't.

I did post about this on the BOINC boards suggesting that users should be able to prioritise uploads. Last time it was requested at git-hub it was rejected but I think we have good reasons to lobby for it now.
ID: 67781 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 67782 - Posted: 17 Jan 2023, 4:46:37 UTC

All my pending uploads started working today, and are all now complete.
ID: 67782 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,915,528
RAC: 15,795
Message 67785 - Posted: 17 Jan 2023, 8:22:35 UTC - in response to Message 67781.  
Last modified: 17 Jan 2023, 8:23:38 UTC

It would be great if the client did prioritize older WUs first, shame it doesn't.

I'd propose that it does, relatively well. From what I've observed, uploads are done for tasks in due date order and files are uploaded in sequential order. If a given file gets stuck, it's backed off. Depending the type of back-off the file is returned to in some cases sooner, in some cases later. I don't believe we've had issues with clients jumping around and uploading things in random order. Some users might have a stronger preference of tidy uploads but if the urge is strong enough, the user can utilize the Retry Now button for individual files that got stuck and are backed-off. Otherwise I'd say let BOINC do its thing.

... users should be able to prioritise uploads. Last time it was requested at git-hub it was rejected but I think we have good reasons to lobby for it now.

I'd argue against doing this or that there's even a need. It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far. Even the users who've had a hard time getting a connection slot are starting to report completed uploads. Even though CPDN put in a due date grace period I suspect it'll hardly be needed, which I believe was also Glenn's position in an earlier post.

What we should ask BOINC developers to do is get BOINC to check memory availability better before starting new tasks so it doesn't start too many which can lead to crashes (CPDN) or to the system being so bogged down as to become unusable (LHC ATLAS). Something Glenn and Richard have been talking about.
ID: 67785 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org