Thread 'The uploads are stuck'

Author	Message
wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,006,146 RAC: 68,974	Message 67468 - Posted: 9 Jan 2023, 19:44:41 UTC - in response to Message 67451. Last modified: 9 Jan 2023, 20:09:27 UTC Do you want to continue crunching and generating more files, or just want to be able to write the state files and wait for the upload server to recover? If it's the latter case, you just need to free up any space for the state file, which is just tens of MBs. You could ~~wrap up and remove other boinc projects if you have them, or abandon one WU, or even~~ remove some host applications that you don't immediately need for next few days. Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. ID: 67468 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,306,920 RAC: 297	Message 67478 - Posted: 9 Jan 2023, 22:10:49 UTC - in response to Message 67437. Last modified: 9 Jan 2023, 22:13:09 UTC wujj123456 wrote: There are completed WUs approaching deadline soon. I have WUs due in 11 days and 11 days no longer look that long given the upload server has been down for two weeks. If somehow the new storage array again has problems, or some new issues showing up which is not that uncommon for new systems, we probably need server to extend deadlines to not waste work. After the reporting deadline, your work isn't obsolete right away. The server would create a replica task from the same workunit, would have to wait for a work-requesting host to assign this new task to, and then wait for that host to return a valid result. Until that happens — which can potentially be a long time after your reporting deadline —, the server will still opportunistically accept a result from your original task. (And give credit to it if valid… normally. Not sure about CPDN, where credit is assigned separately.) PS, if you return a valid result for the original task, after the server already assigned a replica task to another host, three things can follow: 1) The other host returns a result too. AFAIK it will get credit if valid. 2) The other host issues an unrelated scheduler request to the server. In the response, the server informs the other host that the replica is no longer needed. 2.a) If the host hasn't started the replica task yet, the task is aborted, and thereby no CPU cycles are wasted on it. 2.b) If the host already started the replica task, same as 1: it will finish and report it and AFAIK get credit if valid. ID: 67478 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,006,146 RAC: 68,974	Message 67485 - Posted: 10 Jan 2023, 0:59:37 UTC - in response to Message 67478. Last modified: 10 Jan 2023, 1:01:53 UTC Thanks. That's in line with what I observe from other projects. When I said wasted work, I mostly mean the unnecessary replicas being sent out, especially given only upload server is down. It could end up being a lot of duplicates if upload server is not restored before many WU's timeout. Thanks to Glenn's constant updates though, I feel hopefully we won't reach that point. ID: 67485 · Reply Quote

leloft Send message Joined: 7 Jun 17 Posts: 23 Credit: 44,434,789 RAC: 2,600,991	Message 67488 - Posted: 10 Jan 2023, 8:11:17 UTC - in response to Message 67468. Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results. My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts? ID: 67488 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 67496 - Posted: 10 Jan 2023, 13:05:44 UTC - in response to Message 67488. Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results. My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts? Under your account's computing preferences, you can set set "Leave at least x GB free" (of disk space) to make sure there is enough left for uploads, etc. ID: 67496 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 67497 - Posted: 10 Jan 2023, 13:07:36 UTC - in response to Message 67454. Upload server update 9/1/23 10:49GMT From a meeting this morning with CPDN they do not expect the upload server to be available until 17:00GMT TOMORROW (10th) at the earliest. The server itself is running, but they have to move many Tbs of data but also want to monitor the newly configured server to check it is stable. As already said, these are issues caused by the cloud provider, not CPDN themselves. Thanks for the update, Glenn. FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone. ID: 67497 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67501 - Posted: 10 Jan 2023, 13:57:08 UTC - in response to Message 67497. FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone. Why? What is it supposed to do? I see no "max uploads per project" in cc_config.xml. Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? <max_event_log_lines>5000</max_event_log_lines> <max_file_xfers>8</max_file_xfers> <max_file_xfers_per_project>2</max_file_xfers_per_project> <max_stderr_file_size>0.000000</max_stderr_file_size> <max_stdout_file_size>0.000000</max_stdout_file_size> <max_tasks_reported>0</max_tasks_reported> ID: 67501 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054	Message 67503 - Posted: 10 Jan 2023, 14:19:28 UTC - in response to Message 67501. If your Internet connection can do more than one, why not do it? Because the project's server probably only has one internet connection, too. We don't know what type, how fast, how configured - but we all have to share it. And it's going to be very, very, busy. Spread the love, eh? ID: 67503 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 67504 - Posted: 10 Jan 2023, 15:27:18 UTC - in response to Message 67501. Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? Yes, that's what I mean: <max_file_xfers>4</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project> For normal HTTPS traffic, yes, you want about 4 connections per server, and most browsers do 4 to 8 connections at a time anyway, because most big websites are server farms (multiple servers that can all work in parallel). However, file transfers are a different beast and BOINC projects in particular are, as most are grant funded (i.e., run on minimal hardware). Your 1 allowed file transfer will still download or upload at the maximum possible speed, limited by the project's internet connection. It does no good to hammer the same project file server with multiple connections, if connection #2 runs at half speed, connection #3 at 1/3 speed, etc. In other words, it won't take longer for YOU but it will help the project server by only needing to serve 1 connect per client x 1000 active users, etc. ID: 67504 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 67505 - Posted: 10 Jan 2023, 15:54:37 UTC - in response to Message 67504. For CPDN I have my xfers_per_project set to 10 for each of my machines, because I know my fibre can handle it and so can the CPDN upload server (when it's working). Their upload server is on a big UK cloud server that was handling 10s of 1000s of uploads/hr for the OpenIFS tasks. The Weather@Home ones are more of an issue because they go to a server in New Zealand which I understand doesn't have quite the same capacity. Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? Yes, that's what I mean: <max_file_xfers>4</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project> For normal HTTPS traffic, yes, you want about 4 connections per server, and most browsers do 4 to 8 connections at a time anyway, because most big websites are server farms (multiple servers that can all work in parallel). However, file transfers are a different beast and BOINC projects in particular are, as most are grant funded (i.e., run on minimal hardware). Your 1 allowed file transfer will still download or upload at the maximum possible speed, limited by the project's internet connection. It does no good to hammer the same project file server with multiple connections, if connection #2 runs at half speed, connection #3 at 1/3 speed, etc. In other words, it won't take longer for YOU but it will help the project server by only needing to serve 1 connect per client x 1000 active users, etc. ID: 67505 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 67506 - Posted: 10 Jan 2023, 15:59:36 UTC Upload server status: 10/Jan 16:00GMT Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow. ID: 67506 · Reply Quote

[SG]Felix Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374	Message 67512 - Posted: 10 Jan 2023, 18:36:25 UTC I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. The VM with oIFS still has about 14Gb left. ID: 67512 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67513 - Posted: 10 Jan 2023, 19:19:33 UTC - in response to Message 67512. Last modified: 10 Jan 2023, 19:21:08 UTC I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. The VM with oIFS still has about 14Gb left. Luckily, I do not run VMs, so I am not getting nervous right now. My machine has a 512 Gigabyte SSD for the root, home, boot, and swap partitions. And I have two 4 Terabyte spinning hard drives. Since I did not want to run out of space, envisioning large Oifs job requirements, I made a 512 Gigabyte partition on one of them and mounted it on /var/lib/boinc where my distro's version of boinc client stuff goes. CPDN is currently using 55 Gigabytes of this, there is 364 Gigabytes free for Boinc, and 64 GBytes I could give to Boinc if it is ever needed. ID: 67513 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67514 - Posted: 10 Jan 2023, 19:27:56 UTC - in response to Message 67512. I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. The VM with oIFS still has about 14Gb left. So just shut the tasks down until you've got space free. It's not your problem to solve that the upload server is down. I've got a few machines halted on "Too many uploads in progress" (yes, I know how to fix it, just don't see a point right now), and a few others are running out of disk because I put cheap 128GB M.2 SSDs in my compute rigs - "Designing for upload servers being down for weeks with huge tasks" was not a design criteria I considered, and will remain one I won't consider given the relative rarity of the problem. If the machines are full due to things out of my control, they're full. And if contracts aren't met because a lot of machines are unable to compute because they can't return results, similarly, not my problem. It'll still take me a week+ to upload my pending results, unless I can get some good overnight bandwidth out of Starlink (of course, that doesn't solve that a bunch of the machines are solar powered and don't run overnight). I've got hundreds of gigabytes to upload, and I simply can't do that quickly. ID: 67514 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 67521 - Posted: 10 Jan 2023, 22:29:25 UTC Last modified: 10 Jan 2023, 23:23:59 UTC Update. 22:30. 10/Jan Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again. Edit: I'm seeing 'no route to host' errors. Maybe something at upload server needs re-enabling. Anyway, I'm told the data has been successfully migrated and the upload server has been enabled. Any amiss can be dealt with quickly come office hrs tomorrow I would think. ID: 67521 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420	Message 67525 - Posted: 10 Jan 2023, 23:04:06 UTC Last modified: 10 Jan 2023, 23:06:23 UTC Yes I am still seeing "connect(): failed" messages on all upload tries. But I still have 4 work units running and I am no where near filling up any disks, so no problem here. Conan ID: 67525 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 67528 - Posted: 11 Jan 2023, 2:21:23 UTC - in response to Message 67521. Might be a hung process taking over port 80. They'll have to stop the service, kill any orphaned processes, and restart the service. ID: 67528 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 67532 - Posted: 11 Jan 2023, 10:32:19 UTC My uploads have now started. On 100KB/second it will be a while! ID: 67532 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681	Message 67533 - Posted: 11 Jan 2023, 10:42:30 UTC - in response to Message 67532. I've just had confirmation from CPDN that the upload server is now fully functional. My uploads have now started. On 100KB/second it will be a while! ID: 67533 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054	Message 67536 - Posted: 11 Jan 2023, 11:01:37 UTC Mine have started too. Reasonable speeds (for my line) around 1,000 KB/sec once they've latched on, but occasional files pause and need to retry. Expected, for this stage of a recovery. And I've been able to report several tasks with just a few loose ends to tidy up. ID: 67536 · Reply Quote