Message boards : Number crunching : Tasks failing on Ubuntu 22
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
Hi. I had two tasks running on my shiny new Ubuntu 22 system. Knowing about the problems with 64-bits and the need to install additional stuff I first installed Ubuntu 18, put the additional stuff for CPDN (and other projects) on it, then upgraded to Ubuntu 20, then to Ubuntu 22. The tasks seem to run for their full length, but they return as faulty. I'd like to know if there's anything in the machine's stderr out that might identify problems. Is anybody able to elaborate on that? I run one CPDN task at a time, but have the machine do other Boinc related work on all its threads, but I do so on all my machines all the time, and my other Ryzen handed successful work back to the server. - - - - - - - - - - Greetings, Jens |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,716,561 RAC: 8,355 |
<message> Disk usage limit exceeded </message> would be a good place to start. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
Ouch! Thanks for getting me back to earth. So I presume you think it's a reasonable thing to start reading western texts from above? ;-) I looked at the end because the task did that lots of things, and it was just confusing to me. And, looking at the beginning and the numbers above the stderr out, I don't really understand how it couldn't have enough disk space. Boinc was allowed to use 100GB, up to half of the disk (whatever is reached first), and there's 800GB of unused disk space on a 1TB device. I now set this to 200GB, but it still fail to understand this. - - - - - - - - - - Greetings, Jens |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,979,489 RAC: 68,046 |
What you are exceeding is the per task disk usage limit. The result you linked had peak disk usage above 7GB configured for the task. This thread is likely relevant, if that computer frequently suspend and resume tasks for whatever reason. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
I already applied the setting to keep tasks in memory while suspended. How does the usage limit for tasks work? Can I adjust it? - - - - - - - - - - Greetings, Jens |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Hi Jens, Do you have 'leave non-GPU tasks in memory while suspended' unchecked (blank) under Disk & memory boincmgr computing preferences. I suspect you don't. Please enable it. I can tell from your log that the model is constantly restarting (lots of STEP ... lines, then a whole bunch of startup messages repeatedly). This is happening because the boinc client suspends the task, and because you don't have the option above on, it kicks it out of memory probably because another OS process wants it effectively killing the process (boinc runs all tasks at nice level 19, the lowest priority so getting kicked out of memory is quite likely). The model then has to restart from its checkpoint files. Now, that would be all ok except that every time the model restarts it keeps its old ones around just as a backup. The more the model restarts, the more of these checkpoint files accumulate until you hit the task limit. I am going to change this model behaviour but I can't do it for these batches. For now, unless if causes you a problem, please enable 'leave non-GPU in memory'. That will solve it. We've seen this happen alot unfortunately. If you can't enable this option for any reason, let me know. Another way to fix it would be to change 'usage limits' to 'Use at most 100% of CPU'. That will keep the model running all the time, but will of course affect all boinc tasks running so you might not want to solve it that way. Regarding 32bit libraries, the OpenIFS tasks are all 64bit, and do not need any further OS libraries installed. Best, Glenn Hi. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
I already applied the setting to keep tasks in memory while suspended. [quote]For now, unless if causes you a problem, please enable 'leave non-GPU in memory'. That will solve it. Regarding 32bit libraries, the OpenIFS tasks are all 64bit, and do not need any further OS libraries installed.[quote] I already had this setting enabled after commenting in the corresponding thread on 2023-01-01, so for the second task it should already have been applied. The machine didn't even use a quarter of its memory, although it is allowed to use up to three quarters. It might have taken another 32GB to do whatever if needed. I may have kicked the second task out of memory once by reducing the number of threads I gave to Boinc, but that's not even near to restarting all the time. Strange. Nice thing the 32bit stuff isn't needed for this application; one potential issue less. I think it was for HadAM/HadSM. My second Ubuntu system didn't even get one task straight before I installed what I found in the forums. Well, for the moment I don't think I should do any more in regard of fiddling with the system, but I'll keep my eyes on this and have given more resources to CPDN to keep other tasks even more from kicking it out of the way because Boinc thinks they're more important. Thanks a lot for everyone's suggestions and ideas so far! - - - - - - - - - - Greetings, Jens |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,979,489 RAC: 68,046 |
Read the log in more detail and I think you might be able to figure out even now. There are multiple lines of this message in your log, but not in any of my WUs that were never paused. Quit request received from BOINC client, ending the child process The timestamp right before that line should let you to jump to the right point of boinc logs to understand why boinc decided to pause the task. For example, I would run this for the first pause around 15:20 journalctl -u boinc-client --since "2023-01-02 15:20:03" This would not answer why the "leave non-GPU tasks in memory while suspended" is not effective though. The assumption is that if that's checked, pausing shouldn't cause any problems. Should the task even receive the quit request at first place? |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
If you changed the option to "leave tasks in memory" but did not read the file to update BOINC with the change it may not work until it is read. Restarting BOINC would also read the file. Conan |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,884,880 RAC: 19,188 |
If the options are changed via the BOINC manager, the config files get re-read automatically once you hit Save. If the changes are made directly in the config files themselves, then yes, one must run a command to re-read the relevant file(s). gemini8, For future reference. To run the Hadley 32-bit models (HadAM, etc.), you don't need to start from Ubuntu 18, upgrade to 20, and then 22. I'm not sure it'll even work. Just do a clean install of the version you want and then install the needed 32-bit libraries. The very simple instructions for that for different Linuxes and versions are listed at the top of the Unix/Linux section of the forum. This way, not only will it be less time consuming but you'll have a cleaner system. One potential reason for your issue could be that there's a lot of task swapping going on. I.e. BOINC works on a group of tasks for a short while and then switches to another group and then perhaps yet another before coming back to the original one. A suggestion would be to set "Switch between tasks every __ minutes" setting to a very high number like 10080 minutes (1 week). Task swapping is inefficient and can clog up a lot of memory, especially if "Leave non-GPU tasks in memory while suspended" is on. Regardless how long they take, let one group of tasks finish before starting another. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Read the log in more detail and I think you might be able to figure out even now. There are multiple lines of this message in your log, but not in any of my WUs that were never paused.It's quite hard for the task to know why the client told it to quit. The most likely cause is lack of 'leave non-gpu in memory' and the client needs the task to suspend (because it's not allowed to use 100% cpu). Even if that option is enabled, I believe the task can still be kicked out of memory if the OS decides it needs it for something else. Were you doing anything at the time that was particularly memory hungry? All the boinc tasks run with the lowest system priority (19) and will be the first to go if some other process needs the RAM. I might be wrong but I don't think the client can tell the OS to keep the processes in memory at the expense of all else. The timestamp right before that line should let you to jump to the right point of boinc logs to understand why boinc decided to pause the task. For example, I would run this for the first pause around 15:20 I'd suggest double checking the option is still on after closing & opening boincmgr, just in case something weird is going on. Maybe enabling that option doesn't affect currently running tasks and only ones started after the option change? |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,306,092 RAC: 230 |
Glenn Carver wrote: All the boinc tasks run with the lowest system priority (19) and will be the first to go if some other process needs the RAM. I might be wrong but I don't think the client can tell the OS to keep the processes in memory at the expense of all else.Correct. The boinc client cannot do this. But the application itself could do it (via mlock() or mmap(), which require the caller to hold certain privileges), but this should be reserved to applications with realtime requirements, not to mere bulk processing applications, and it certainly shouldn't be done (and may not succeed) with large memory regions. Anyway. If the kernel's process scheduler preempts a CPDN task, then that's just like a suspend-to-RAM and possibly page-out-RAM-to-a-swap-device. In contrast, if the boinc client requested a CPDN task to suspend-to-disk, then the task would have to write its checkpoint data. Glenn Carver wrote: I'd suggest double checking the option is still on after closing & opening boincmgr, just in case something weird is going on. Maybe enabling that option doesn't affect currently running tasks and only ones started after the option change?At least all boinc client versions which I have been using so far applied this option change (in whichever direction, on or off) immediately to all currently running tasks. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Anyway. If the kernel's process scheduler preempts a CPDN task, then that's just like a suspend-to-RAM and possibly page-out-RAM-to-a-swap-device. In contrast, if the boinc client requested a CPDN task to suspend-to-disk, then the task would have to write its checkpoint data.Swapping out won't cause the model to terminate though (unless there's not enough swap). The boinc client has to send either a quit or abort request to the controlling wrapper, which then sends a SIGKILL to the model process. If the client sends a suspend request, the model process gets a SIGSTOP; it does not interpret signals like this so won't write a checkpoint restart. Instead the model has its own internal mechanism for periodically writing checkpoint restart files. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
Thanks for your idea! Just to make it clear again: This machine hasn't seen memory usage exceeding 30% of the 64GB while swap is at 0%. RAM isn't an issue here. Leave non-GPU tasks in memory has been ticked between the first and second task. Also, I had a look at the log file, and the only thing I find is: 02-Jan-2023 17:10:34 [climateprediction.net] Aborting task oifs_43r3_ps_0094_2007050100_123_976_12192738_0: exceeded disk limit: 7590.70MB > 7168.00MB with nothing unusual preceding it. ATM I'm waiting for the third task to finish in hope it will be ok. - - - - - - - - - - Greetings, Jens |
Send message Joined: 16 Jun 05 Posts: 16 Credit: 19,492,951 RAC: 10,279 |
Thanks for your idea! Have you looked at the BOINC Manager DISK pie chart screen? It shows Disk Usage pie charts and you can visually see how much space you have left on the disk partition. You can increase the DISK available with the OPTIONS : COMPUTING PREFERENCES popup. Look at the DISK AND MEMORY options. I think I remember some problems with leaving any of the 3 disk options blank. Try setting those 3 values and check the DISK USAGE pie charts. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
Storage and memory usage Use no more than 200 GB Leave at least free 5 GB Use no more than 50 % of total disk space The first option was at 100 GB when I did those two tasks. Of my 1TB SSD about 750 GB are free. - - - - - - - - - - Greetings, Jens |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,306,092 RAC: 230 |
@rjs5, there are two types of disk limits: – What you refer to is the global limit for everything which happens in the boinc client, all projects and all tasks summed up. – Independent of that, there is an individual limit for each task. This one is the one which caused the failure according to @gemini8's log line. The per-task limit is controlled by the project admin, not by the user. (Unless the user performs certain modifications in the client's state file, which is not intended in boinc client's design.) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Hi Jens, Understand your points below. Could you please check something for me? Go into the /var/lib/boinc/slots directory (or wherever your boinc is installed), and run this command (it must be run from 'slots'). ls -l ?/srf*and let me know what output you get. The srf files are the biggest of the model restart files. For example, I have: $ cd /var/lib/boinc/slots/; ls -l 0/srf* -rw-r--r-- 1 boinc boinc 804992476 Jan 6 00:52 0/srf00260000.0001 -rw-r--r-- 1 boinc boinc 804992476 Jan 6 10:13 0/srf00330000.0001There should be 1 file for every restart the model has done. The number after 'srf' is the model step count when the file was written. If the task is running 100% cpu as set in Computing preferences in boincmgr, you should only have 1 srf file. In my example above, the model has restarted once as I shutdown my PC at night. If you have alot of these files per slot directory, then the model is restarting often and we need to understand why. To save space (if needed) you can safely delete the older srf files, but always leave the most recent file otherwise the model will not be able to restart at all. In the example above, I could safely delete srf00260000 as that's the lowest number and the oldest date but I must leave the srf00330000 file. Cheers, Glenn Just to make it clear again: |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,488,619 RAC: 2,087 |
ls -l ?/srf*and let me know what output you get. $ sudo ls -l ?/srf* ls: Zugriff auf '?/srf*' nicht möglich: Datei oder Verzeichnis nicht gefunden which means file or directory not found. *edit* Replacing the ? with any slot ls shows me gets me nothing either. *end edit* - - - - - - - - - - Greetings, Jens |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,741,514 RAC: 87,063 |
Try sudo ls -l ?/ | grep srf |
©2024 cpdn.org