climateprediction.net (CPDN) home page
Thread 'Upload server is out of disk space'

Thread 'Upload server is out of disk space'

Message boards : Number crunching : Upload server is out of disk space
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 47,674,094
RAC: 24,265
Message 67589 - Posted: 12 Jan 2023, 0:30:52 UTC

File uploads were going along quite nicely until this appeared in the boinc log.

Wed 11 Jan 2023 07:27:19 PM EST | climateprediction.net | [error] Error reported by file upload server: Server is out of disk space
ID: 67589 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67590 - Posted: 12 Jan 2023, 0:36:28 UTC

Seeing the same thing. Oh well. I'll shut the machines back down.
ID: 67590 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,744,071
RAC: 63,130
Message 67594 - Posted: 12 Jan 2023, 1:51:06 UTC

It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯
ID: 67594 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,705,793
RAC: 9,655
Message 67600 - Posted: 12 Jan 2023, 8:46:43 UTC

Woke up to this.

I'm also seeing that many uploads have reached 100%, but failed to complete. That suggests that the upload server may have failed to forward the files to backing cloud storage (or may have not done so quickly enough).
ID: 67600 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67601 - Posted: 12 Jan 2023, 10:11:33 UTC - in response to Message 67600.  

Waiting for an update from CPDN. My guess is the transfer server has stopped moving files off the upload server. We'll see. Hopefully most people uploaded enough they can start downloading tasks again.

Woke up to this.

I'm also seeing that many uploads have reached 100%, but failed to complete. That suggests that the upload server may have failed to forward the files to backing cloud storage (or may have not done so quickly enough).
ID: 67601 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,705,793
RAC: 9,655
Message 67603 - Posted: 12 Jan 2023, 10:25:22 UTC - in response to Message 67601.  

Thanks - please continue to keep us updated as and when.

I've suspended networking on the machine which has more disk space available - it can carry on crunching at least until tomorrow without pestering the upload server (and save me money, because I'm not using the GPUs while concentrating on IFS).

The machine with restricted disk space is doing GPU work (quick in and out, no long-term build up on disk), so will only contact the servers sporadically as the backoffs expire.
ID: 67603 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67609 - Posted: 12 Jan 2023, 12:44:56 UTC - in response to Message 67603.  

I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too.
Thanks - please continue to keep us updated as and when.

I've suspended networking on the machine which has more disk space available - it can carry on crunching at least until tomorrow without pestering the upload server (and save me money, because I'm not using the GPUs while concentrating on IFS).

The machine with restricted disk space is doing GPU work (quick in and out, no long-term build up on disk), so will only contact the servers sporadically as the backoffs expire.
ID: 67609 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67615 - Posted: 12 Jan 2023, 17:36:09 UTC - in response to Message 67594.  

It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯


It makes sense. Each upload takes up a HTTP slot on the server for some long while (minutes, in my case). When the server is out of connection slots, things just time out - it can't get your connection serviced.

When it's returning errors, that's a quick (milliseconds) sort of response. So it can service far, far more clients when it simply has to say, "I'm full, go away," than when it's processing a lot of long running uploads.
ID: 67615 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67617 - Posted: 12 Jan 2023, 18:32:46 UTC - in response to Message 67594.  

wujj123456 wrote:
It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯
The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different.
ID: 67617 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67631 - Posted: 13 Jan 2023, 2:44:37 UTC - in response to Message 67617.  

The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different.


I could not upload a
UK Met Office HadSM4 at N144 resolution v8.02-i686-pc-linux-gnu
task result until the upload11.cpdn.org.server started working again (before it quit again).
ID: 67631 · Report as offensive     Reply Quote
mikey

Send message
Joined: 18 Nov 18
Posts: 21
Credit: 6,595,163
RAC: 2,029
Message 67680 - Posted: 14 Jan 2023, 2:29:22 UTC - in response to Message 67631.  

The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different.


I could not upload a
UK Met Office HadSM4 at N144 resolution v8.02-i686-pc-linux-gnu
task result until the upload11.cpdn.org.server started working again (before it quit again).


So YOU broke it this time, LOL!!! I too am stuck trying to upload completed tasks and have actually suspended the Project on several pc's to stop the crunching and constant back and forth stuff and let it settle down so everyone can get their stuff thru.
ID: 67680 · Report as offensive     Reply Quote
[AF] Kalianthys

Send message
Joined: 20 Dec 20
Posts: 13
Credit: 40,052,490
RAC: 9,149
Message 67681 - Posted: 14 Jan 2023, 7:45:13 UTC - in response to Message 67680.  

Hello,

I could not upload windows task Weather At Home 2.
A have more ten tasks with an error on upload :
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error


Can you help me ?

Kali.
ID: 67681 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 67682 - Posted: 14 Jan 2023, 8:14:57 UTC - in response to Message 67681.  

Hello,

I could not upload windows task Weather At Home 2.
A have more ten tasks with an error on upload :
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error


Can you help me ?

Kali.
If going to upload11, this should resolve when the backlog of OIFS tasks has cleared. if in options>event log options you enable http debug you should be able to see if that is the case. The XML file for that batch isn't on the Trello board the project uses for me to check from here. The other way to find out is looking at client_state.xml where each task should have a line saying what the upload handler is.
ID: 67682 · Report as offensive     Reply Quote
leloft

Send message
Joined: 7 Jun 17
Posts: 23
Credit: 44,434,789
RAC: 2,600,991
Message 67688 - Posted: 14 Jan 2023, 9:21:11 UTC - in response to Message 67609.  

I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too.


Here is an observation: I have five hosts with WU in uploading status. Of these five, three of them are successfully uploading files and as they are disgorging their backlog, they are able to download new WU, process and upload them. The two other hosts that are failing to secure an upload slot are blocked from downloading as they are up to capacity and therefore idle. Can anyone confirm that actively crunching machines are more successful at elbowing their way in to an upload slot? If so, it seems that it would be a shame that these machines are uploading 20 hours into a 28 day deadline, while backlog-enforced idling hosts are unable to fight their way onto the server. Just an observation, but it feels that it is more than just a sampling error.

fraser
ID: 67688 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,705,793
RAC: 9,655
Message 67689 - Posted: 14 Jan 2023, 9:40:10 UTC - in response to Message 67688.  

Yes, that probably is true. BOINC has an extensive system of 'backoffs': if something isn't working, it'll pause and wait - for longer and longer. But it will try a newly created upload, just once, as soon as its been created. If that single upload gets through, then the backoffs are cleared, and everything starts moving again.

You can try and clear things, by using the 'retry' tools in BOINC Manager, but it gets very tedious, very quickly. Might be worth having a look, and giving things a prod, when you happen to be passing the machine. Otherwise, simply wait until the rush has died down - BOINC will retry periodically, just not very often.
ID: 67689 · Report as offensive     Reply Quote
MiB1734

Send message
Joined: 16 Jan 18
Posts: 2
Credit: 121,919,969
RAC: 2,111
Message 67692 - Posted: 14 Jan 2023, 10:15:29 UTC

I have about 2.5 TB result files and can upload about 10 GB. This means to resolve the backlog takes 250 days
ID: 67692 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,016,442
RAC: 21,024
Message 67700 - Posted: 14 Jan 2023, 12:31:37 UTC

I am now down to 16 tasks uploading. I think I will be clear by the end of play tomorrow. Keeping to just one task running till backlog is cleared.
ID: 67700 · Report as offensive     Reply Quote
[AF] Kalianthys

Send message
Joined: 20 Dec 20
Posts: 13
Credit: 40,052,490
RAC: 9,149
Message 67702 - Posted: 14 Jan 2023, 13:17:00 UTC - in response to Message 67682.  

Hello,

I could not upload windows task Weather At Home 2.
A have more ten tasks with an error on upload :
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_16.zip: transient HTTP error
14/01/2023 08:42:09 | climateprediction.net | Temporarily failed upload of wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip: transient HTTP error


Can you help me ?

Kali.
If going to upload11, this should resolve when the backlog of OIFS tasks has cleared. if in options>event log options you enable http debug you should be able to see if that is the case. The XML file for that batch isn't on the Trello board the project uses for me to check from here. The other way to find out is looking at client_state.xml where each task should have a line saying what the upload handler is.


Thank You Dave,

There is that in the xml file :

<file>
    <name>wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip</name>
    <nbytes>90031062.000000</nbytes>
    <max_nbytes>150000000.000000</max_nbytes>
    <md5_cksum>e20a8b248529e2d3f15e277a2a530f41</md5_cksum>
    <status>1</status>
    <upload_url>http://upload4.cpdn.org/cgi-bin/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>56</num_retries>
        <first_request_time>1671650199.948561</first_request_time>
        <next_request_time>1673693268.434832</next_request_time>
        <time_so_far>46278.530403</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>


Kali.
ID: 67702 · Report as offensive     Reply Quote
MiB1734

Send message
Joined: 16 Jan 18
Posts: 2
Credit: 121,919,969
RAC: 2,111
Message 67706 - Posted: 14 Jan 2023, 14:24:43 UTC - in response to Message 67700.  

I have 1400 tasks to upload. This means 2.5 TB. if there is no wonder the backlog is forever.
ID: 67706 · Report as offensive     Reply Quote
leloft

Send message
Joined: 7 Jun 17
Posts: 23
Credit: 44,434,789
RAC: 2,600,991
Message 67707 - Posted: 14 Jan 2023, 14:29:24 UTC - in response to Message 67689.  

You can try and clear things, by using the 'retry' tools in BOINC Manager


What would that be in boinccmd? --network_available seems to do nothing, I assumed it was a toggle;
--file_transfer requires a filename and doesn't work with wildcards. I was hoping to set up a cronjob to try and improve the chances of getting a slot.

It seems to be a case of giving to those who already have. Is there someway the backing off time period could be reduced to a few minutes for those machines that have failed to upload and a few tens of minutes for those that succeeded? If the question is simply a correlation between number of attempts and successful uploads, then to allow unsuccessful attempts shorter times between tries would stand a better chance of clearing some of these 'too many uploads' errors, at least enough to allow the stalled hosts to resume active duty. Just a thought.

fraser
ID: 67707 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Upload server is out of disk space

©2024 cpdn.org