Message boards : Number crunching : Best Swap file size for CPDN?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
The server is now running at 30% (load) eg 6+ GB /task, so there should be no reason why any further errors should happen (from my end). And just to be clear because you trying to imply we are just wasting power. We have done 667 successful tasks in the last few months (on this server) with the majority done in the last few weeks. And at 7000w for 256t, the premise that we are the ones who are inefficient is franky absurd. I am not psychic, if i can run at 50-99% load on every project on BOINC without getting 50% failure rate, then I of course (like most) would assume this is the case at CPDN. I already stated the the 1 task quota was related to the errors; I never mentioned network connections. Thank you for all the advice and help, especially about rsc_memory_bound from Glenn. We have everything we need now. We wont be running VirtualBox, or upgrading RAM anytime soon. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
No, that wasn't my point at all. If you have done good work in the past, and if your other machines are producing good work now (I didn't check those), that's all good. But you posted about one specific machine, and about one specific problem. I missed your post about the reason for that particular sequence of errors: if you know it, and have fixed it, then that's OK too. But for you and the other readers of this thread: a quota of 'one per day' will not start to increase until until you can complete and report successful tasks. I've learned over the years not to do anything experimental during a breakdown. Sometimes you can fall down a trapdoor and not be able to claw your way out until things are fixed. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
667 successful tasks on *this server, most in the last few weeks. The only thing that changed was that CPDN put out a large amount of tasks which allowed large hosts to get loaded up with tasks; in our case around 60%. However I doubt we were the only ones that ran with less memory /task than rsc_memory_bound, so I guess you might see a pretty high failure rate across the board (on large core hosts). Thank you for your info that you cannot get out of BOINC Jail, until you complete AND report tasks; something that no-one can do presently. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
If the recent errors were due to your attempts to increase available disk space by installing an additional drive, fair enough - I've thought about doing that too, if this outage continues to extend. I'm now warned that it's not as easy as it looks. But in the same post, you did ask: Can anything be done to remove this quota?Yes, there are two ways. 1) you can persuade a project administrator to delve into the project's master database, locate the record for your computer, and change the figure back to something bigger, and let the healing process continue from there. 2) or you can wait it out, and make sure those single tasks per day are ready to upload and report when the project is ready for them. The more you can report on that day, the more your quota will increase - but it may take several days to reach 256. To be honest, come Tuesday morning, I suspect the project's attention will be fixed on getting the pending results gathered in as quickly as possible: my two 6-core machines each have over 70 waiting already, and I can go on adding to that number almost indefinitely. I'm not going to do anything to risk them. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
If the recent errors were due to your attempts to increase available disk space by installing an additional drive Richard are you saying that it is incorrect to think that these tasks had such a high failure rate because ram use was above rsc_memory_bound and hence Swap was constantly used? That is how I understood it from what Glenn said. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
I haven't followed that part of the conversation in detail. There wasn't any detailed 'reason for failure' in the task I looked at on the web. The only clues are likely to be buried deep in the history of BOINC's event log (stdoutdae.txt on disk in Windows, in the system journal for Linux service installs). But it may depend on which event log flags were active at the time. Edit - my own guess is that the extra time wasted by disk swapping and thrashing at the end of the run might have exceeded some obscure time limit. The tasks had certainly run for an extended period, but with the current state of the server, we can't see exactly how far they'd got. If you can track down a task which uploaded a full stderr.txt to the web, but is still reported as a failure, that might contain some clues. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
I actually did notice that most of the tasks that failed did so just at the very end. If you can track down a task which uploaded a full stderr.txt to the web, but is still reported as a failure, that might contain some clues. Not sure I want to look through 332 errors. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Actually there were 2 issues. One was the 100GB limit which was the cause of some of the earlier errors. 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED https://www.cpdn.org/result.php?resultid=22264124 And I believe the 2nd was when it started not having enough memory and hence started use the swap: 1 (0x00000001) Unknown error code stderr: https://gist.github.com/ncoded/c875e9a955252dd2a15540914de2e059 And before any asks no we didn't click cancel or abort or anything like in the Client (no matter what it says in stderr). |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Yes, I got caught by that 100 GB limit too. When uploads started failing on Christmas Eve, I knew we were likely to be in for a long haul, and I was due to be away from home for four days. So I cached as much work as I could overnight, and let the machines chew on them at leisure while I was away. On one machine, that worked - it's still working through that stock (but BOINC is beginning to grumble about deadlines). But the other machine failed - I'd inadvertently left the 100 GB limit trigger in place - and that stock ran out yesterday. But I've spent 15 years learning how to drive this thing, and I know what can be tweaked in an emergency. I'm fetching work again, and still crunching, while keeping all my results ready for the scientists. I'm not going to post the technique in public - it could be dangerous if used inappropriately - but it can be done. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Well let's hope the upload issue gets sorted shortly so all can get back to crunching. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Anything above 10% task fails is probably above average and needs some investigating. Ideally, 5% is where you should be. Actually there were 2 issues.I've been looking at these fails. Does this limit refer to the total disk usage under '../projects/climateprediction.net' plus '../slots/0' or is it specifically for the task allocated space in '../slots/0/'? Apologies if this has been asked before. There is an issue with the restart files which can accumulate if the model is restarted frequently. I notice from your second link the log shows the model restarting several times. If the model restarts it will keep its files for that instance until the job ends, as a backup. However, this is a problem if the model restarts frequently as these files can accumulate (they are ~1Gb) and add to the total space. I am going to change this. And I believe the 2nd was when it started not having enough memory and hence started use the swap:No, the model failed to restart properly because of bad data read from a restart file, not due to lack of memory or swapping (swapping will cause the model to slow down enormously - unless there is no swap configured then it will crash but so will your OS processes). What I think is happening is sometimes the restart data is not flushed to disk before the task is stopped for any reason. The code issues 'flush to disk', but this is only a request not a command to the operating system. So it is possible to get incomplete restart files if the model is killed at the wrong time (am unsure if flush still happens in this case), and because we only keep one around (see above), the model will fail. Nothing you can do about this, it will happen every now and again in this configuration if the model restarts. That's still a theory, I have not been able to get enough info from the task logs to confirm. Last, do not try to oversubscribe tasks in terms of available memory, you are asking for trouble and a lousy throughput. Any swapping will slow everything down and risk crashing processes depending on the memory pressure. Most likely when it hits the high water memory part of the code and I can spot those by looking at the traceback to see where it failed. As I look at the fails, it would avoid wasting my time if people don't try this! |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
I've been looking at these fails. Does this limit refer to the total disk usage under '../projects/climateprediction.net' plus '../slots/0' or is it specifically for the task allocated space in '../slots/0/'? Apologies if this has been asked before. I have just repeated what someone explained to me that there is a indirect 100GB limit, meaning a default value was set on the one of the EditBoxes in the UI. And if you leave a non-zero, non-empty value then it uses that Default value. In terms of Slots, I only know what i just read at: https://boinc.berkeley.edu/trac/wiki/BoincFiles#:~:text=The%20slot%20directory%20contains%20links,file%20in%20the%20project%20directory). Which doesn't really explain much apart from them being XML files with paths in them. do not try to oversubscribe tasks in terms of available memory Now I am aware of rsc_memory_bound I can see what value is required not to 'over subscribe' in terms of available memory. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
Because I can, let's look at the figures: (figures taken from a Windows BOINC Manager, remotely viewing a Linux cruncher) I think that the 100 GB limit, because it's a BOINC limit, will be the first of those: "used by BOINC". On that machine, I can clearly recover nearly 6 GB by simply resetting Einstein and GPUGrid (they're both idle at the moment). It's not worth thinking about the tiny amount "not available to BOINC" - though it's more than twice the size of my first hard disk. I'm surprised by the amount "used by other programs". I suspect it may be because I recently did an in-situ upgrade from Mint 20.3 to Mint 21: I suspect there there are a lot of optional reversal tools hanging around for safety. I must go and read chapter 2 of the 'upgrading in situ' manual. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I think that the 100 GB limit, because it's a BOINC limit, will be the first of those: "used by BOINC".Ah right. It's a total size for everything boinc (i.e. under /var/lib/boinc or wherever it is). And if you hit that limit, we'll see the disk limit exceeded error. I was curious because one of the changes we made was to correct the task disk size required (rsc_disk_bound), which was previously set 10x too high. Then with these latest batches we see an awful lot more disk limit exceeded errors. So I think it's related, but now I realise there's alot less room for CPDN to use if people have the 100Gb limit implicitly enabled. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,708,278 RAC: 9,361 |
An hour, and some heavy housekeeping, later: "free, available to BOINC" has gone up to 76.2 GB. But disk usage by CPDN has gone up to 147.44 GB. I'm going to have to keep an eye on this. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I'm surprised by the amount "used by other programs". 267.81GB That is a surprise to me too. My machine uses 3.44 GB for that.. I give an entire partition to Boinc. 68.36 GB used by Boinc. 351.64 GB available to Boinc. 64.60 GB free, not available to Boinc. (but I could allocate it to Boinc) 67.88GB to CPDN. Won't increase because I cannot download any more tasks. Mon 02 Jan 2023 08:23:21 AM EST | climateprediction.net | Not requesting tasks: too many uploads in progress This makes sense, but since I have lots more room, I do not see why they limit me. (This is not a request for any change. I was going to set No More Tasks anyway; I have 42 tasks to upload.) 243.61Mb to WCG. The are usually out of work these days. 193.23 MB to Einstein 4KB to Rosetta: I rset the project keeping only app_config 25.08 MB to Milky Way 10.41 MB to Universe |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
["You have reached your quota of 1 task per day"] Richard Haselgrove wrote: ncoded.com wrote:3) You can set up a new boinc client instance. (For details, some team forums have recipes posted for the setup of multiple client instances on a single physical host. A good understanding of client configuration and control is required, blind reliance on copy+paste from these recipes is unlikely to go well.)Can anything be done to remove this quota?1) you can persuade a project administrator to delve into the project's master database, locate the record for your computer, and change the figure back to something bigger, and let the healing process continue from there. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,786,359 RAC: 64,141 |
However I doubt we were the only ones that ran with less memory /task than rsc_memory_bound, so I guess you might see a pretty high failure rate across the board (on large core hosts). You aren't the only one. It's rare to have 4GB per core these days, especially on host with many cores since memory is very expensive and no one just throws away that amount of money without a good reason. However, a lot of us (based on what I saw on forums in discussion), me included, has an app_config that restricts the number of OpenIFS tasks can be run concurrently so we don't run out of memory (OOM) and get the task killed. OOM kill could be another reason for your failures. Luckily, you can easily verify this by checking kernel log on your host (dmesg), as all OOM kills will dump a huge amount of trace and stats there yelling out of memory. |
Send message Joined: 16 Aug 16 Posts: 73 Credit: 53,400,150 RAC: 3,821 |
Okay thanks wujj123456 |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,022,240 RAC: 20,762 |
3) You can set up a new boinc client instance. (For details, some team forums have recipes posted for the setup of multiple client instances on a single physical host. A good understanding of client configuratEasiest way imho is to set up a VM of some sort which then gets seen as a different computer. I have looked at multiple clients on one host recipes and decided not to go down that route. |
©2024 cpdn.org