Upload server is out of disk space

Author	Message
biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265	Message 67589 - Posted: 12 Jan 2023, 0:30:52 UTC File uploads were going along quite nicely until this appeared in the boinc log. Wed 11 Jan 2023 07:27:19 PM EST \| climateprediction.net \| [error] Error reported by file upload server: Server is out of disk space ID: 67589 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67590 - Posted: 12 Jan 2023, 0:36:28 UTC Seeing the same thing. Oh well. I'll shut the machines back down. ID: 67590 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,561,645 RAC: 58,935	Message 67594 - Posted: 12 Jan 2023, 1:51:06 UTC It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯ ID: 67594 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 67600 - Posted: 12 Jan 2023, 8:46:43 UTC Woke up to this. I'm also seeing that many uploads have reached 100%, but failed to complete. That suggests that the upload server may have failed to forward the files to backing cloud storage (or may have not done so quickly enough). ID: 67600 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 67601 - Posted: 12 Jan 2023, 10:11:33 UTC - in response to Message 67600. Waiting for an update from CPDN. My guess is the transfer server has stopped moving files off the upload server. We'll see. Hopefully most people uploaded enough they can start downloading tasks again. Woke up to this. I'm also seeing that many uploads have reached 100%, but failed to complete. That suggests that the upload server may have failed to forward the files to backing cloud storage (or may have not done so quickly enough). ID: 67601 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 67603 - Posted: 12 Jan 2023, 10:25:22 UTC - in response to Message 67601. Thanks - please continue to keep us updated as and when. I've suspended networking on the machine which has more disk space available - it can carry on crunching at least until tomorrow without pestering the upload server (and save me money, because I'm not using the GPUs while concentrating on IFS). The machine with restricted disk space is doing GPU work (quick in and out, no long-term build up on disk), so will only contact the servers sporadically as the backoffs expire. ID: 67603 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 67609 - Posted: 12 Jan 2023, 12:44:56 UTC - in response to Message 67603. I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too. Thanks - please continue to keep us updated as and when. I've suspended networking on the machine which has more disk space available - it can carry on crunching at least until tomorrow without pestering the upload server (and save me money, because I'm not using the GPUs while concentrating on IFS). The machine with restricted disk space is doing GPU work (quick in and out, no long-term build up on disk), so will only contact the servers sporadically as the backoffs expire. ID: 67609 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 67615 - Posted: 12 Jan 2023, 17:36:09 UTC - in response to Message 67594. It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯ It makes sense. Each upload takes up a HTTP slot on the server for some long while (minutes, in my case). When the server is out of connection slots, things just time out - it can't get your connection serviced. When it's returning errors, that's a quick (milliseconds) sort of response. So it can service far, far more clients when it simply has to say, "I'm full, go away," than when it's processing a lot of long running uploads. ID: 67615 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 67617 - Posted: 12 Jan 2023, 18:32:46 UTC - in response to Message 67594. wujj123456 wrote: It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯ The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different. ID: 67617 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67631 - Posted: 13 Jan 2023, 2:44:37 UTC - in response to Message 67617. The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different. I could not upload a UK Met Office HadSM4 at N144 resolution v8.02-i686-pc-linux-gnu task result until the upload11.cpdn.org.server started working again (before it quit again). ID: 67631 · Reply Quote

mikey Send message Joined: 18 Nov 18 Posts: 21 Credit: 6,588,536 RAC: 1,801	Message 67680 - Posted: 14 Jan 2023, 2:29:22 UTC - in response to Message 67631. The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different. I could not upload a UK Met Office HadSM4 at N144 resolution v8.02-i686-pc-linux-gnu task result until the upload11.cpdn.org.server started working again (before it quit again). So YOU broke it this time, LOL!!! I too am stuck trying to upload completed tasks and have actually suspended the Project on several pc's to stop the crunching and constant back and forth stuff and let it settle down so everyone can get their stuff thru. ID: 67680 · Reply Quote

[AF] Kalianthys Send message Joined: 20 Dec 20 Posts: 13 Credit: 40,040,893 RAC: 9,902	Message 67681 - Posted: 14 Jan 2023, 7:45:13 UTC - in response to Message 67680. Hello, I could not upload windows task Weather At Home 2. A have more ten tasks with an error on upload : 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error Can you help me ? Kali. ID: 67681 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,975,025 RAC: 21,875	Message 67682 - Posted: 14 Jan 2023, 8:14:57 UTC - in response to Message 67681. Hello, I could not upload windows task Weather At Home 2. A have more ten tasks with an error on upload : 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error Can you help me ? Kali. If going to upload11, this should resolve when the backlog of OIFS tasks has cleared. if in options>event log options you enable http debug you should be able to see if that is the case. The XML file for that batch isn't on the Trello board the project uses for me to check from here. The other way to find out is looking at client_state.xml where each task should have a line saying what the upload handler is. ID: 67682 · Reply Quote

leloft Send message Joined: 7 Jun 17 Posts: 23 Credit: 44,434,789 RAC: 2,600,991	Message 67688 - Posted: 14 Jan 2023, 9:21:11 UTC - in response to Message 67609. I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too. Here is an observation: I have five hosts with WU in uploading status. Of these five, three of them are successfully uploading files and as they are disgorging their backlog, they are able to download new WU, process and upload them. The two other hosts that are failing to secure an upload slot are blocked from downloading as they are up to capacity and therefore idle. Can anyone confirm that actively crunching machines are more successful at elbowing their way in to an upload slot? If so, it seems that it would be a shame that these machines are uploading 20 hours into a 28 day deadline, while backlog-enforced idling hosts are unable to fight their way onto the server. Just an observation, but it feels that it is more than just a sampling error. fraser ID: 67688 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,691,690 RAC: 10,582	Message 67689 - Posted: 14 Jan 2023, 9:40:10 UTC - in response to Message 67688. Yes, that probably is true. BOINC has an extensive system of 'backoffs': if something isn't working, it'll pause and wait - for longer and longer. But it will try a newly created upload, just once, as soon as its been created. If that single upload gets through, then the backoffs are cleared, and everything starts moving again. You can try and clear things, by using the 'retry' tools in BOINC Manager, but it gets very tedious, very quickly. Might be worth having a look, and giving things a prod, when you happen to be passing the machine. Otherwise, simply wait until the rush has died down - BOINC will retry periodically, just not very often. ID: 67689 · Reply Quote

MiB1734 Send message Joined: 16 Jan 18 Posts: 2 Credit: 121,919,969 RAC: 2,111	Message 67692 - Posted: 14 Jan 2023, 10:15:29 UTC I have about 2.5 TB result files and can upload about 10 GB. This means to resolve the backlog takes 250 days ID: 67692 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,975,025 RAC: 21,875	Message 67700 - Posted: 14 Jan 2023, 12:31:37 UTC I am now down to 16 tasks uploading. I think I will be clear by the end of play tomorrow. Keeping to just one task running till backlog is cleared. ID: 67700 · Reply Quote

[AF] Kalianthys Send message Joined: 20 Dec 20 Posts: 13 Credit: 40,040,893 RAC: 9,902	Message 67702 - Posted: 14 Jan 2023, 13:17:00 UTC - in response to Message 67682. Hello, I could not upload windows task Weather At Home 2. A have more ten tasks with an error on upload : 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error 14/01/2023 08:42:09 \| climateprediction.net \| Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error Can you help me ? Kali. If going to upload11, this should resolve when the backlog of OIFS tasks has cleared. if in options>event log options you enable http debug you should be able to see if that is the case. The XML file for that batch isn't on the Trello board the project uses for me to check from here. The other way to find out is looking at client_state.xml where each task should have a line saying what the upload handler is. Thank You Dave, There is that in the xml file : <file> <name>wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip</name> <nbytes>90031062.000000</nbytes> <max_nbytes>150000000.000000</max_nbytes> <md5_cksum>e20a8b248529e2d3f15e277a2a530f41</md5_cksum> <status>1</status> <upload_url>http://upload4.cpdn.org/cgi-bin/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>56</num_retries> <first_request_time>1671650199.948561</first_request_time> <next_request_time>1673693268.434832</next_request_time> <time_so_far>46278.530403</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> Kali. ID: 67702 · Reply Quote

MiB1734 Send message Joined: 16 Jan 18 Posts: 2 Credit: 121,919,969 RAC: 2,111	Message 67706 - Posted: 14 Jan 2023, 14:24:43 UTC - in response to Message 67700. I have 1400 tasks to upload. This means 2.5 TB. if there is no wonder the backlog is forever. ID: 67706 · Reply Quote

leloft Send message Joined: 7 Jun 17 Posts: 23 Credit: 44,434,789 RAC: 2,600,991	Message 67707 - Posted: 14 Jan 2023, 14:29:24 UTC - in response to Message 67689. You can try and clear things, by using the 'retry' tools in BOINC Manager What would that be in boinccmd? --network_available seems to do nothing, I assumed it was a toggle; --file_transfer requires a filename and doesn't work with wildcards. I was hoping to set up a cronjob to try and improve the chances of getting a slot. It seems to be a case of giving to those who already have. Is there someway the backing off time period could be reduced to a few minutes for those machines that have failed to upload and a few tens of minutes for those that succeeded? If the question is simply a correlation between number of attempts and successful uploads, then to allow unsuccessful attempts shorter times between tries would stand a better chance of clearing some of these 'too many uploads' errors, at least enough to allow the stalled hosts to resume active duty. Just a thought. fraser ID: 67707 · Reply Quote