climateprediction.net home page
Best Swap file size for CPDN?

Best Swap file size for CPDN?

Message boards : Number crunching : Best Swap file size for CPDN?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67206 - Posted: 2 Jan 2023, 10:11:44 UTC
Last modified: 2 Jan 2023, 10:23:30 UTC

The server is now running at 30% (load) eg 6+ GB /task, so there should be no reason why any further errors should happen (from my end).


And just to be clear because you trying to imply we are just wasting power. We have done 667 successful tasks in the last few months (on this server) with the majority done in the last few weeks. And at 7000w for 256t, the premise that we are the ones who are inefficient is franky absurd.

I am not psychic, if i can run at 50-99% load on every project on BOINC without getting 50% failure rate, then I of course (like most) would assume this is the case at CPDN.

I already stated the the 1 task quota was related to the errors; I never mentioned network connections.

Thank you for all the advice and help, especially about rsc_memory_bound from Glenn.

We have everything we need now. We wont be running VirtualBox, or upgrading RAM anytime soon.
ID: 67206 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67207 - Posted: 2 Jan 2023, 10:23:32 UTC - in response to Message 67206.  

No, that wasn't my point at all. If you have done good work in the past, and if your other machines are producing good work now (I didn't check those), that's all good.

But you posted about one specific machine, and about one specific problem. I missed your post about the reason for that particular sequence of errors: if you know it, and have fixed it, then that's OK too.

But for you and the other readers of this thread: a quota of 'one per day' will not start to increase until until you can complete and report successful tasks. I've learned over the years not to do anything experimental during a breakdown. Sometimes you can fall down a trapdoor and not be able to claw your way out until things are fixed.
ID: 67207 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67208 - Posted: 2 Jan 2023, 10:43:32 UTC
Last modified: 2 Jan 2023, 11:00:38 UTC

667 successful tasks on *this server, most in the last few weeks.

The only thing that changed was that CPDN put out a large amount of tasks which allowed large hosts to get loaded up with tasks; in our case around 60%.

However I doubt we were the only ones that ran with less memory /task than rsc_memory_bound, so I guess you might see a pretty high failure rate across the board (on large core hosts).

Thank you for your info that you cannot get out of BOINC Jail, until you complete AND report tasks; something that no-one can do presently.
ID: 67208 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67209 - Posted: 2 Jan 2023, 11:08:25 UTC - in response to Message 67208.  

If the recent errors were due to your attempts to increase available disk space by installing an additional drive, fair enough - I've thought about doing that too, if this outage continues to extend. I'm now warned that it's not as easy as it looks.

But in the same post, you did ask:

Can anything be done to remove this quota?
Yes, there are two ways.

1) you can persuade a project administrator to delve into the project's master database, locate the record for your computer, and change the figure back to something bigger, and let the healing process continue from there.
2) or you can wait it out, and make sure those single tasks per day are ready to upload and report when the project is ready for them. The more you can report on that day, the more your quota will increase - but it may take several days to reach 256.

To be honest, come Tuesday morning, I suspect the project's attention will be fixed on getting the pending results gathered in as quickly as possible: my two 6-core machines each have over 70 waiting already, and I can go on adding to that number almost indefinitely. I'm not going to do anything to risk them.
ID: 67209 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67210 - Posted: 2 Jan 2023, 11:41:23 UTC
Last modified: 2 Jan 2023, 11:43:09 UTC

If the recent errors were due to your attempts to increase available disk space by installing an additional drive


Richard are you saying that it is incorrect to think that these tasks had such a high failure rate because ram use was above rsc_memory_bound and hence Swap was constantly used?

That is how I understood it from what Glenn said.
ID: 67210 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67211 - Posted: 2 Jan 2023, 11:49:33 UTC - in response to Message 67210.  
Last modified: 2 Jan 2023, 12:12:57 UTC

I haven't followed that part of the conversation in detail.

There wasn't any detailed 'reason for failure' in the task I looked at on the web. The only clues are likely to be buried deep in the history of BOINC's event log (stdoutdae.txt on disk in Windows, in the system journal for Linux service installs). But it may depend on which event log flags were active at the time.

Edit - my own guess is that the extra time wasted by disk swapping and thrashing at the end of the run might have exceeded some obscure time limit. The tasks had certainly run for an extended period, but with the current state of the server, we can't see exactly how far they'd got. If you can track down a task which uploaded a full stderr.txt to the web, but is still reported as a failure, that might contain some clues.
ID: 67211 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67212 - Posted: 2 Jan 2023, 12:09:23 UTC
Last modified: 2 Jan 2023, 12:38:16 UTC

I actually did notice that most of the tasks that failed did so just at the very end.

If you can track down a task which uploaded a full stderr.txt to the web, but is still reported as a failure, that might contain some clues.

Not sure I want to look through 332 errors.
ID: 67212 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67213 - Posted: 2 Jan 2023, 12:24:15 UTC
Last modified: 2 Jan 2023, 12:52:08 UTC

Actually there were 2 issues.

One was the 100GB limit which was the cause of some of the earlier errors.

196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED

https://www.cpdn.org/result.php?resultid=22264124

And I believe the 2nd was when it started not having enough memory and hence started use the swap:

1 (0x00000001) Unknown error code

stderr: https://gist.github.com/ncoded/c875e9a955252dd2a15540914de2e059

And before any asks no we didn't click cancel or abort or anything like in the Client (no matter what it says in stderr).
ID: 67213 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67214 - Posted: 2 Jan 2023, 12:42:06 UTC - in response to Message 67213.  

Yes, I got caught by that 100 GB limit too. When uploads started failing on Christmas Eve, I knew we were likely to be in for a long haul, and I was due to be away from home for four days. So I cached as much work as I could overnight, and let the machines chew on them at leisure while I was away. On one machine, that worked - it's still working through that stock (but BOINC is beginning to grumble about deadlines). But the other machine failed - I'd inadvertently left the 100 GB limit trigger in place - and that stock ran out yesterday.

But I've spent 15 years learning how to drive this thing, and I know what can be tweaked in an emergency. I'm fetching work again, and still crunching, while keeping all my results ready for the scientists. I'm not going to post the technique in public - it could be dangerous if used inappropriately - but it can be done.
ID: 67214 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67215 - Posted: 2 Jan 2023, 13:07:20 UTC
Last modified: 2 Jan 2023, 13:29:00 UTC

Well let's hope the upload issue gets sorted shortly so all can get back to crunching.
ID: 67215 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 67219 - Posted: 2 Jan 2023, 14:25:07 UTC - in response to Message 67213.  
Last modified: 2 Jan 2023, 14:26:38 UTC

Anything above 10% task fails is probably above average and needs some investigating. Ideally, 5% is where you should be.

Actually there were 2 issues.
One was the 100GB limit which was the cause of some of the earlier errors.
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED
https://www.cpdn.org/result.php?resultid=22264124
I've been looking at these fails. Does this limit refer to the total disk usage under '../projects/climateprediction.net' plus '../slots/0' or is it specifically for the task allocated space in '../slots/0/'? Apologies if this has been asked before.

There is an issue with the restart files which can accumulate if the model is restarted frequently. I notice from your second link the log shows the model restarting several times. If the model restarts it will keep its files for that instance until the job ends, as a backup. However, this is a problem if the model restarts frequently as these files can accumulate (they are ~1Gb) and add to the total space. I am going to change this.

And I believe the 2nd was when it started not having enough memory and hence started use the swap:
1 (0x00000001) Unknown error code
stderr: https://gist.github.com/ncoded/c875e9a955252dd2a15540914de2e059.
No, the model failed to restart properly because of bad data read from a restart file, not due to lack of memory or swapping (swapping will cause the model to slow down enormously - unless there is no swap configured then it will crash but so will your OS processes).

What I think is happening is sometimes the restart data is not flushed to disk before the task is stopped for any reason. The code issues 'flush to disk', but this is only a request not a command to the operating system. So it is possible to get incomplete restart files if the model is killed at the wrong time (am unsure if flush still happens in this case), and because we only keep one around (see above), the model will fail. Nothing you can do about this, it will happen every now and again in this configuration if the model restarts. That's still a theory, I have not been able to get enough info from the task logs to confirm.

Last, do not try to oversubscribe tasks in terms of available memory, you are asking for trouble and a lousy throughput. Any swapping will slow everything down and risk crashing processes depending on the memory pressure. Most likely when it hits the high water memory part of the code and I can spot those by looking at the traceback to see where it failed. As I look at the fails, it would avoid wasting my time if people don't try this!
ID: 67219 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67220 - Posted: 2 Jan 2023, 15:19:16 UTC

I've been looking at these fails. Does this limit refer to the total disk usage under '../projects/climateprediction.net' plus '../slots/0' or is it specifically for the task allocated space in '../slots/0/'? Apologies if this has been asked before.


I have just repeated what someone explained to me that there is a indirect 100GB limit, meaning a default value was set on the one of the EditBoxes in the UI. And if you leave a non-zero, non-empty value then it uses that Default value.

In terms of Slots, I only know what i just read at:

https://boinc.berkeley.edu/trac/wiki/BoincFiles#:~:text=The%20slot%20directory%20contains%20links,file%20in%20the%20project%20directory).

Which doesn't really explain much apart from them being XML files with paths in them.

do not try to oversubscribe tasks in terms of available memory


Now I am aware of rsc_memory_bound I can see what value is required not to 'over subscribe' in terms of available memory.
ID: 67220 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67224 - Posted: 2 Jan 2023, 17:29:05 UTC

Because I can, let's look at the figures:



(figures taken from a Windows BOINC Manager, remotely viewing a Linux cruncher)

I think that the 100 GB limit, because it's a BOINC limit, will be the first of those: "used by BOINC".

On that machine, I can clearly recover nearly 6 GB by simply resetting Einstein and GPUGrid (they're both idle at the moment).

It's not worth thinking about the tiny amount "not available to BOINC" - though it's more than twice the size of my first hard disk.

I'm surprised by the amount "used by other programs". I suspect it may be because I recently did an in-situ upgrade from Mint 20.3 to Mint 21: I suspect there there are a lot of optional reversal tools hanging around for safety. I must go and read chapter 2 of the 'upgrading in situ' manual.
ID: 67224 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1048
Credit: 16,404,330
RAC: 16,403
Message 67225 - Posted: 2 Jan 2023, 18:28:12 UTC - in response to Message 67224.  

I think that the 100 GB limit, because it's a BOINC limit, will be the first of those: "used by BOINC".
On that machine, I can clearly recover nearly 6 GB by simply resetting Einstein and GPUGrid (they're both idle at the moment).
Ah right. It's a total size for everything boinc (i.e. under /var/lib/boinc or wherever it is). And if you hit that limit, we'll see the disk limit exceeded error.

I was curious because one of the changes we made was to correct the task disk size required (rsc_disk_bound), which was previously set 10x too high. Then with these latest batches we see an awful lot more disk limit exceeded errors. So I think it's related, but now I realise there's alot less room for CPDN to use if people have the 100Gb limit implicitly enabled.
ID: 67225 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,691,690
RAC: 10,582
Message 67226 - Posted: 2 Jan 2023, 18:39:48 UTC - in response to Message 67225.  

An hour, and some heavy housekeeping, later: "free, available to BOINC" has gone up to 76.2 GB. But disk usage by CPDN has gone up to 147.44 GB. I'm going to have to keep an eye on this.
ID: 67226 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67227 - Posted: 2 Jan 2023, 21:31:32 UTC - in response to Message 67224.  
Last modified: 2 Jan 2023, 21:34:04 UTC

I'm surprised by the amount "used by other programs". 267.81GB


That is a surprise to me too. My machine uses 3.44 GB for that..

I give an entire partition to Boinc.
68.36 GB used by Boinc.
351.64 GB available to Boinc.
64.60 GB free, not available to Boinc. (but I could allocate it to Boinc)

67.88GB to CPDN. Won't increase because I cannot download any more tasks.
   Mon 02 Jan 2023 08:23:21 AM EST | climateprediction.net | Not requesting tasks: too many uploads in progress

This makes sense, but since I have lots more room, I do not see why they limit me. (This is not a request for any change. I was going to set No More Tasks anyway; I have 42 tasks to upload.)

243.61Mb to WCG. The are usually out of work these days.
193.23 MB to Einstein
4KB to Rosetta: I rset the project keeping only app_config
25.08 MB to Milky Way
10.41 MB to Universe
ID: 67227 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67266 - Posted: 3 Jan 2023, 22:33:24 UTC - in response to Message 67209.  
Last modified: 3 Jan 2023, 22:43:14 UTC

["You have reached your quota of 1 task per day"]
Richard Haselgrove wrote:
ncoded.com wrote:
Can anything be done to remove this quota?
1) you can persuade a project administrator to delve into the project's master database, locate the record for your computer, and change the figure back to something bigger, and let the healing process continue from there.
2) or you can wait it out, and make sure those single tasks per day are ready to upload and report when the project is ready for them. The more you can report on that day, the more your quota will increase - but it may take several days to reach 256.
3) You can set up a new boinc client instance. (For details, some team forums have recipes posted for the setup of multiple client instances on a single physical host. A good understanding of client configuration and control is required, blind reliance on copy+paste from these recipes is unlikely to go well.)
ID: 67266 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,550,769
RAC: 58,365
Message 67273 - Posted: 4 Jan 2023, 4:18:32 UTC - in response to Message 67208.  

However I doubt we were the only ones that ran with less memory /task than rsc_memory_bound, so I guess you might see a pretty high failure rate across the board (on large core hosts).

You aren't the only one. It's rare to have 4GB per core these days, especially on host with many cores since memory is very expensive and no one just throws away that amount of money without a good reason. However, a lot of us (based on what I saw on forums in discussion), me included, has an app_config that restricts the number of OpenIFS tasks can be run concurrently so we don't run out of memory (OOM) and get the task killed. OOM kill could be another reason for your failures. Luckily, you can easily verify this by checking kernel log on your host (dmesg), as all OOM kills will dump a huge amount of trace and stats there yelling out of memory.
ID: 67273 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 53,400,150
RAC: 3,821
Message 67275 - Posted: 4 Jan 2023, 7:35:55 UTC

Okay thanks wujj123456
ID: 67275 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4535
Credit: 18,971,712
RAC: 21,921
Message 67276 - Posted: 4 Jan 2023, 7:55:30 UTC

3) You can set up a new boinc client instance. (For details, some team forums have recipes posted for the setup of multiple client instances on a single physical host. A good understanding of client configurat
Easiest way imho is to set up a VM of some sort which then gets seen as a different computer. I have looked at multiple clients on one host recipes and decided not to go down that route.
ID: 67276 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Best Swap file size for CPDN?

©2024 cpdn.org