climateprediction.net (CPDN) home page
Thread 'Upload server is out of disk space'

Thread 'Upload server is out of disk space'

Message boards : Number crunching : Upload server is out of disk space
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 69489 - Posted: 17 Aug 2023, 16:22:57 UTC - in response to Message 69486.  

Maybe there's a quick check Andy can do on the Korean server? There was an earlier message from computermzle(?) about looking for a blank string in the config. If someone can let me know which file exactly to look in I can discuss with Andy. Might be quick fix, if not, will rule it out.
In theory, certainly yes. But I don't have a magic carpet that will take him exactly to the point in question.

From the log I extracted yesterday, we know that that server is running "Apache/2.4.37 (centos)". That in turn leads to https://httpd.apache.org/docs/2.4/configuring.html, but then I'm stuck. It's the sort of thing that an experienced professional WebMaster could probably do in her sleep, but Andy is spread so thinly that he probably doesn't qualify - he's a generalist (like me), not a specialist. And I gather he's fighting another, more urgent, fire today. I'll send him an email for the morning.

I've just posted in Q&A that we're up against the clock on this one too. We're probably within 60 days of the scientific data held in the stuck upload files being permanently deleted. We shouldn't forget why we're really here.
ID: 69489 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69492 - Posted: 17 Aug 2023, 21:26:45 UTC - in response to Message 69489.  

Ok, once Andy has sorted out the ssl certificate issue on the dev site, I'll message him about this and pass on the info about a potential missing blank. He might know what it is if he's come across it before.
ID: 69492 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 69494 - Posted: 18 Aug 2023, 9:40:05 UTC

I see the other fire has been put out, but I've had an urgent call-out for this morning - too late for me to write to Andy now. It's one of those which might be 5 seconds or 5 hours, 5 minutes or 5 days. Back when I'm back.

Please tell Andy to investigate first. The missing blank line is just a theory/possibility at the moment, not confirmed (I did a test on another project yesterday, and the BOINC client didn't log the blank line - but did restart the upload). It would probably be helpful to extract the server log for one of the affected hosts we've identified here and in Q&A. If he hasn't got time to be selective, just grab the whole bally lot and hand it over to one of us to filter.
ID: 69494 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 69495 - Posted: 18 Aug 2023, 11:45:05 UTC

OK, I'n back. Plan A worked - five minutes of professional insight, an hour and a half to reassure the user and escape the house. I need a lie-down.
ID: 69495 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69507 - Posted: 22 Aug 2023, 11:14:51 UTC - in response to Message 69495.  

Following a meeting yesterday, I've been asked by the CPDN Project Director to pass on a message.

If anyone is having problems with 'stuck uploads', then Abort the task. This batch is of questionable scientific quality because of the very high number of failures.

As this batch is now closed, no resends for any Aborted tasks will be sent out to others.

Problems on the server have been investigated. One of their disks filled completely resulting in a move of data, which may be causing the problem. The server will be looked at again before any more batches go out (hopefully in the next couple of weeks when folk return from holiday).
---
CPDN Visiting Scientist
ID: 69507 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 69509 - Posted: 22 Aug 2023, 11:33:13 UTC - in response to Message 69507.  

Please DON'T do that. Instead, cancel the TRANSFER only - in the transfers tab - and the task should become 'ready to report'. Update the project as normal, and you'll get a lot of disk space freed up, without a blot on your account record.
ID: 69509 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 69540 - Posted: 30 Aug 2023, 16:41:24 UTC - in response to Message 69509.  
Last modified: 30 Aug 2023, 16:41:54 UTC

Somehow 2/4 WUs managed to get through to the server and they've been labelled success. I continue to struggle with the 4 zips left of the other 2 WUs. I may cancel the upload (transfer) only as Richard suggested.
ID: 69540 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 69593 - Posted: 6 Sep 2023, 13:33:55 UTC - in response to Message 69540.  

Any news on the upload server in Korea. I still keep the last WUs in the queue trying to upload hoping they pass though and results are not wasted. I can wait until they expire.
ID: 69593 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,022,240
RAC: 20,762
Message 69594 - Posted: 6 Sep 2023, 14:12:00 UTC - in response to Message 69593.  

Any news on the upload server in Korea. I still keep the last WUs in the queue trying to upload hoping they pass though and results are not wasted. I can wait until they expire.
Given that this batch is due to be rejigged and then go out again I would be tempted to just wait until they time out as you say. I can't remember if there is a maximum number of tries after which BOINC will abort them? Richard is more likely to have a quick answer to that than I am. If I were desperate for space I would abort the transfers but given the amount of space most modern rigs have, if you are in that desperate need of space you probably shouldn't be running BOINC anyway.
ID: 69594 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,708,278
RAC: 9,361
Message 69595 - Posted: 6 Sep 2023, 14:42:37 UTC - in response to Message 69594.  

It's a time limit, rather than the number of retries. 90 days from first attempt.
ID: 69595 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 69604 - Posted: 7 Sep 2023, 17:57:40 UTC - in response to Message 69595.  

Thanks Richard. I may not have the patience to wait 90 days, though keeping up with CPDN is :) Will see if the zips manage to get through, otherwise will cancel uploads.
ID: 69604 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 69613 - Posted: 8 Sep 2023, 12:03:36 UTC - in response to Message 69604.  

Thanks Richard. I may not have the patience to wait 90 days, though keeping up with CPDN is :) Will see if the zips manage to get through, otherwise will cancel uploads.
I recommend cancelling the uploads. The Korean scientist has already analysed the data returned so far and we're now preparing for the replacement batch. I saw the results in a meeting this week. There's no value keeping them on your machine, but thanks for reporting.
---
CPDN Visiting Scientist
ID: 69613 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70029 - Posted: 4 Nov 2023, 13:51:32 UTC - in response to Message 69509.  
Last modified: 4 Nov 2023, 14:08:25 UTC

Please DON'T do that. Instead, cancel the TRANSFER only - in the transfers tab - and the task should become 'ready to report'. Update the project as normal, and you'll get a lot of disk space freed up, without a blot on your account record.
Oops, clicked wrong Quote button. No delete post button.
ID: 70029 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 70030 - Posted: 4 Nov 2023, 13:54:11 UTC - in response to Message 69507.  
Last modified: 4 Nov 2023, 14:08:08 UTC

Following a meeting yesterday, I've been asked by the CPDN Project Director to pass on a message.
If anyone is having problems with 'stuck uploads', then Abort the task. This batch is of questionable scientific quality because of the very high number of failures.
As this batch is now closed, no resends for any Aborted tasks will be sent out to others.
Problems on the server have been investigated. One of their disks filled completely resulting in a move of data, which may be causing the problem. The server will be looked at again before any more batches go out (hopefully in the next couple of weeks when folk return from holiday).
What batch?
Should I abort the 16 Linux HadSM4 WUs I received a week ago? The trickles seem to go through but the ULs do not.

Edit: I found my answer, "The following hadsm4 batches for the DOCILE project have now been closed: 937, 938, 939, 940, 941. These batches were issued in Nov/22."
ID: 70030 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Upload server is out of disk space

©2024 cpdn.org