Message boards : Number crunching : The uploads are stuck
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,006,146 RAC: 68,974 |
Do you want to continue crunching and generating more files, or just want to be able to write the state files and wait for the upload server to recover? If it's the latter case, you just need to free up any space for the state file, which is just tens of MBs. You could Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,306,920 RAC: 297 |
wujj123456 wrote: There are completed WUs approaching deadline soon. I have WUs due in 11 days and 11 days no longer look that long given the upload server has been down for two weeks. If somehow the new storage array again has problems, or some new issues showing up which is not that uncommon for new systems, we probably need server to extend deadlines to not waste work.After the reporting deadline, your work isn't obsolete right away. The server would create a replica task from the same workunit, would have to wait for a work-requesting host to assign this new task to, and then wait for that host to return a valid result. Until that happens — which can potentially be a long time after your reporting deadline —, the server will still opportunistically accept a result from your original task. (And give credit to it if valid… normally. Not sure about CPDN, where credit is assigned separately.) PS, if you return a valid result for the original task, after the server already assigned a replica task to another host, three things can follow: 1) The other host returns a result too. AFAIK it will get credit if valid. 2) The other host issues an unrelated scheduler request to the server. In the response, the server informs the other host that the replica is no longer needed. 2.a) If the host hasn't started the replica task yet, the task is aborted, and thereby no CPU cycles are wasted on it. 2.b) If the host already started the replica task, same as 1: it will finish and report it and AFAIK get credit if valid. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,006,146 RAC: 68,974 |
Thanks. That's in line with what I observe from other projects. When I said wasted work, I mostly mean the unnecessary replicas being sent out, especially given only upload server is down. It could end up being a lot of duplicates if upload server is not restored before many WU's timeout. Thanks to Glenn's constant updates though, I feel hopefully we won't reach that point. |
Send message Joined: 7 Jun 17 Posts: 23 Credit: 44,434,789 RAC: 2,600,991 |
Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results. My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts? |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. Under your account's computing preferences, you can set set "Leave at least x GB free" (of disk space) to make sure there is enough left for uploads, etc. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Upload server update 9/1/23 10:49GMT Thanks for the update, Glenn. FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone. Why? What is it supposed to do? I see no "max uploads per project" in cc_config.xml. Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? <max_event_log_lines>5000</max_event_log_lines> <max_file_xfers>8</max_file_xfers> <max_file_xfers_per_project>2</max_file_xfers_per_project> <max_stderr_file_size>0.000000</max_stderr_file_size> <max_stdout_file_size>0.000000</max_stdout_file_size> <max_tasks_reported>0</max_tasks_reported> |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
If your Internet connection can do more than one, why not do it?Because the project's server probably only has one internet connection, too. We don't know what type, how fast, how configured - but we all have to share it. And it's going to be very, very, busy. Spread the love, eh? |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? Yes, that's what I mean: <max_file_xfers>4</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project> For normal HTTPS traffic, yes, you want about 4 connections per server, and most browsers do 4 to 8 connections at a time anyway, because most big websites are server farms (multiple servers that can all work in parallel). However, file transfers are a different beast and BOINC projects in particular are, as most are grant funded (i.e., run on minimal hardware). Your 1 allowed file transfer will still download or upload at the maximum possible speed, limited by the project's internet connection. It does no good to hammer the same project file server with multiple connections, if connection #2 runs at half speed, connection #3 at 1/3 speed, etc. In other words, it won't take longer for YOU but it will help the project server by only needing to serve 1 connect per client x 1000 active users, etc. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
For CPDN I have my xfers_per_project set to 10 for each of my machines, because I know my fibre can handle it and so can the CPDN upload server (when it's working). Their upload server is on a big UK cloud server that was handling 10s of 1000s of uploads/hr for the OpenIFS tasks. The Weather@Home ones are more of an issue because they go to a server in New Zealand which I understand doesn't have quite the same capacity. Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Upload server status: 10/Jan 16:00GMT Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow. |
Send message Joined: 4 Oct 15 Posts: 34 Credit: 9,075,151 RAC: 374 |
I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. The VM with oIFS still has about 14Gb left. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. Luckily, I do not run VMs, so I am not getting nervous right now. My machine has a 512 Gigabyte SSD for the root, home, boot, and swap partitions. And I have two 4 Terabyte spinning hard drives. Since I did not want to run out of space, envisioning large Oifs job requirements, I made a 512 Gigabyte partition on one of them and mounted it on /var/lib/boinc where my distro's version of boinc client stuff goes. CPDN is currently using 55 Gigabytes of this, there is 364 Gigabytes free for Boinc, and 64 GBytes I could give to Boinc if it is ever needed. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left. So just shut the tasks down until you've got space free. It's not your problem to solve that the upload server is down. I've got a few machines halted on "Too many uploads in progress" (yes, I know how to fix it, just don't see a point right now), and a few others are running out of disk because I put cheap 128GB M.2 SSDs in my compute rigs - "Designing for upload servers being down for weeks with huge tasks" was not a design criteria I considered, and will remain one I won't consider given the relative rarity of the problem. If the machines are full due to things out of my control, they're full. And if contracts aren't met because a lot of machines are unable to compute because they can't return results, similarly, not my problem. It'll still take me a week+ to upload my pending results, unless I can get some good overnight bandwidth out of Starlink (of course, that doesn't solve that a bunch of the machines are solar powered and don't run overnight). I've got hundreds of gigabytes to upload, and I simply can't do that quickly. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Update. 22:30. 10/Jan Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again. Edit: I'm seeing 'no route to host' errors. Maybe something at upload server needs re-enabling. Anyway, I'm told the data has been successfully migrated and the upload server has been enabled. Any amiss can be dealt with quickly come office hrs tomorrow I would think. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Yes I am still seeing "connect(): failed" messages on all upload tries. But I still have 4 work units running and I am no where near filling up any disks, so no problem here. Conan |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Might be a hung process taking over port 80. They'll have to stop the service, kill any orphaned processes, and restart the service. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
My uploads have now started. On 100KB/second it will be a while! |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
I've just had confirmation from CPDN that the upload server is now fully functional. My uploads have now started. On 100KB/second it will be a while! |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,718,239 RAC: 8,054 |
Mine have started too. Reasonable speeds (for my line) around 1,000 KB/sec once they've latched on, but occasional files pause and need to retry. Expected, for this stage of a recovery. And I've been able to report several tasks with just a few loose ends to tidy up. |
©2024 cpdn.org