Message boards : Number crunching : w/u failed at the 89th zip file
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error "double free or corruption (out)" Anybody had one of these? Just curious what it might mean?? Ta Nairb |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,004,017 RAC: 21,574 |
This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative errorDon't have time to search it out right now but there is something about this in the OIFS discussion thread. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
nairb, A couple of things to consider, in order of priority: 1) How many concurrent tasks are you running? With the amount of RAM that you have, running more than 3 at a time will likely lead to high failure rate. Two may not be a bad idea if your PC is heavily used for other things. 2) Do you have "Leave non-GPU tasks in memory while suspended" enabled in Computing preferences? It's highly recommended, especially if the tasks often get interrupted for any reason like task swapping, BOINC/PC restarts. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
2) Do you have "Leave non-GPU tasks in memory while suspended" enabled in Computing preferences? It's highly recommended, especially if the tasks often get interrupted for any reason like task swapping, BOINC/PC restarts. Yes its ticked. It's a dedicated machine and I try to ensure that once a climate task starts its not suspended by other work and runs to completion. When I checked the machine today, the machine seemed to be running almost idle with 2 tasks using almost no cpu time. It needed a hard reboot. I should have done a memory check but it looks to have come back to life, but has dumped 2 of the working w/u's with computation errors. When it's cleared the running jobs I will run a memory checker just to be sure. I do tend to load the thing with 4 climate jobs and 4 WCG jobs at once....... its done ok so far. But maybe I have been lucky. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
Yes its ticked. It's a dedicated machine and I try to ensure that once a climate task starts its not suspended by other work and runs to completion. It looks like you have the 2nd point taken care of, which is good. As for the first one... Looking at your history of OIFS tasks, the failure rate is ~26% (10/39), which is high. Glenn said that the goal he'd like to reach is under 5%. High failure rate is bad for the project (no result to scientist) and the user (wasted CPU time and power). I don't think I've done any WCG CPU work since they shut down for the move so I don't know how demanding their tasks are, especially for RAM. Regardless, it'd seem reasonable to me to go down to 3 OIFS tasks and let things run like that for a good week and see what the failure rate is for that time period. If it's above 5%, go down to 2 or reduce other high RAM usage of that PC. Keep trying until you find a workload that produced under 5% failure rate. I believe I found the sweet spot for my 2 PCs but I still have to do some observation of one of them to make sure. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,478,679 RAC: 1,748 |
I second the notion of decreasing the memory load. With 'just' 24 gigs of RAM I'd reduce CPDN to three tasks or drop WCG altogether for the moment. - - - - - - - - - - Greetings, Jens |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
"double free or corruption (out)" means: – Either the program attempted to free (i.e., to deallocate) a memory segment more than once. – Or something illegally overwrote certain data right before the memory segment which was to be freed. Could be a programming error. Or could be a secondary symptom of some earlier program failure. (Edit: Can also be caused by hardware defects, e.g. overclocked RAM or CPU, but IIRC this failure mode has also been seen on Xeon hosts which seem unlikely to be operated in an unstable manner or with defects.) nairb wrote: Anybody had one of these?There are several more reports of this in this message board, and occurrences in stderr.txt of several failed tasks of users who mentioned that they had failures. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
As for the first one... Looking at your history of OIFS tasks, the failure rate is ~26% (10/39), which is high. Glenn said that the goal he'd like to reach is under 5%. High failure rate is bad for the project (no result to scientist) and the user (wasted CPU time and power). I don't think I've done any WCG CPU work since they shut down for the move so I don't know how demanding their tasks are, especially for RAM. Regardless, it'd seem reasonable to me to go down to 3 OIFS tasks and let things run like that for a good week and see what the failure rate is for that time period. If it's above 5%, go down to 2 or reduce other high RAM usage of that PC. Keep trying until you find a workload that produced under 5% failure rate. I have done WCG CPU work since their shutdown. Right now I am running 5 OIFS tasks, 4 WCG CPU tasks, 1 Einstein task, and 2 MilkyWay tasks on my machine and that is not giving me any errors. On the CPDN web site, most of my completed tasks have received credit, but I do not get any credit on the Statistics tab in a very long time. BoincStats seems to know, but the Projects tab does not know either. My machine is pretty much like this: Computer 1511241 Total credit 6,749,805 Average credit 50.54 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 475.26 GB Measured floating point speed 6.05 billion ops/sec Measured integer speed 24.32 billion ops/sec Average upload rate 4801.6 KB/sec Average download rate 6929.83 KB/sec Average turnaround time 2.68 days |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,332,519 RAC: 10,361 |
I've had a couple of these errors on an ubuntu VM. The advice from Glenn was to allow 5GB per IFS task and run a maximum number of openIFS tasks equal to n-1 physical cpu cores. To avoid any unexpected compatibility issues I also dropped running any other projects. With a 4-core i7 cpu and 24GB memory that would be 3 concurrent openIFS tasks; with memory headroom and one physical cpu core for whatever else the computer system wants to run. This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
On the CPDN web site, most of my completed tasks have received credit, but I do not get any credit on the Statistics tab in a very long time. BoincStats seems to know, but the Projects tab does not know either.My total credit is completely consistent across my BOINC Manager (both tabs), this web site (below my name to the left and on my account page), and at BOINCstats. My average credit - aka RAC - on the other hand has been falling, because average credit for the IFS tasks is not being calculated as it should be. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
"double free or corruption (out)" is not caused by lack of free RAM. It's something else. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
"double free or corruption (out)" is not caused by lack of free RAM. If not directly then perhaps indirectly? Could it be that pushing the RAM limits is more likely to bring out these types of memory related problems? |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
double free or corruption (...) Like many other pages this one explains what usually causes that error and what to do to avoid it: https://linuxhint.com/double-free-corruption-error/ Looks like the code of the scientific app needs to be revised to ensure correct pointer assignment. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
AndreyOR wrote: xii5ku wrote:It's hard to tell but I believe that this is unlikely. Keep in mind: When there is a lack of free RAM but some processes on the system request new RAM allocations, what then follows is not that the allocations fail.¹ Instead, the OS first tries to make more RAM available by swapping less recently accessed pages out to swap space (at the price of the entire system becoming less and less responsive, possibly to a degree that users mistake the system for being frozen entirely). When all swap space is used up (or if there isn't any swap space attached in the first place), then the kernel proceeds to act on the out-of-memory situation by picking processes with large memory footprint and terminating them. (This is known as Linux' "OOM killer". That's as if SIGKILL was sent to the affected process, which the process cannot catch. Therefore the process doesn't have a chance anymore to exercise any –possibly buggy– code paths. The process simply goes away immediately.)²"double free or corruption (out)" is not caused by lack of free RAM.If not directly then perhaps indirectly? Could it be that pushing the RAM limits is more likely to bring out these types of memory related problems? That said, the period during which swap space is started to be used and system responsiveness is being degraded, could uncover bugs or misbehaviour in programs with realtime functionality. An example of such functionality could be I/O watchdogs which consider an I/O operation to be failed if it doesn't succeed within a certain time frame. (In turn, the handling of such assumed failure could easily contain bugs, such as memory corruptions, because error handling code paths like this may be rarely exercised in testing.) I don't know if this class of "realtime functionality" is relevant to the OpenIFS application or the CPDN wrapper. ________ ¹) A failed memory allocation would lead to various random program misbehaviour: The program might catch the failure but might not have a good strategy to back out of such a situation. Or the allocation error handler might contain a programming bug. Or the program may not check for failure of the allocation attempt and use the returned error pointer as if it was pointing to successfully allocated memory. In the latter case, the program would most likely crash with a segfault. But see next footnote. ²) Consequently, if a program attempts to allocate memory when the system doesn't have any available anymore, two things can happen: Either the allocation succeeds but the required system call takes a rather long time to complete. Or the process which performs the allocation attempt is terminated by the OOM killer. That is, on Linux, memory allocation requests by userspace processes never fail with a returned error pointer. AFAIK. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,819,420 RAC: 19,777 |
double free or corruption (out) I just had 2 tasks fail with this error on a system that's been error free for the past ~3 weeks since I reduced the number of concurrent tasks. A very cursory look seems to show that it's a problem that's not that uncommon. I don't know if it's frequent enough that it might need to be addressed in the code if we're to keep the error rate under 5%. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
I'm aware of these errors and have been looking into them. You may also see an error message 'free(): ....' at the end as well. Looking at the failure stats for all the batches, it's responsible for ~10% of the total task failures we see. It tends to occur more on Ryzen chips (and possibly Xeon's but I've not completed the analysis), probably because of the larger shared cache. As has been mentioned, decreasing memory pressure tends to help -- decreases the risk of the bad address picking up wrong data. So if it happens too often, try running 1 less CPDN task. That helped me with my Ryzen 5600G, it's rare I see this on my Intel machines. Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from. So it's annoying, we are looking into it but hard to say when it'll be fixed. This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Yup. I agree it can be difficult to track down. I did work for years on call centre kit with over a 1000 concurrent users. We did test for 8000 concurrent jobs on the machines. With multiple layers of software it was a challenge to find the culprit with a memory leak/corruption problem. I always tried to get the application programming teams to "try" and give informative error messages........... not always seen as the most important issue. But a useful error message can save endless hours later!!!. The machine I use for cpdn seems able to run any combination of projects without issues. 8 of anything seems ok, and they all seem to recover from a power cut..... Unlike some cpdn w/u. With 4 cpdn w/u running at once it uses very little swap space and usually shows about 5~6 gig of memory free. I know peak usage will vary. Anyway, I hope the bug is found, since it will save a lot of frustration for everyone |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from.Yup. I agree it can be difficult to track down. I did work for years on call centre kit with over a 1000 concurrent users. We did test for 8000 concurrent jobs on the machines. With multiple layers of software it was a challenge to find the culprit with a memory leak/corruption problem. I've now found & fixed all the memory leaks in our control code wrapper. A new version is about to be tested and will be used for upcoming batches. There are still leaks coming from the boinc_zip part of the boinc software which need further investigation - which I am not about to do right now. Anyway, better to test the new version and see if we still get corruption. I always tried to get the application programming teams to "try" and give informative error messages........... not always seen as the most important issue. But a useful error message can save endless hours later!!!.Yep I was always campaigning for 'user focussed error messages' not developer focussed ones!. The problem with memory corruption though is it often corrupts the stack, so error messages never make it out, or give the chance for the system to flush the buffers. The machine I use for cpdn seems able to run any combination of projects without issues. 8 of anything seems ok, and they all seem to recover from a power cut..... Unlike some cpdn w/u.Avoid swapping like the plague for OpenIFS tasks. It's a large memory model and will seriously affect performance of the task (and your machine), if it has to start swapping. Always make sure there's enough memory headroom. Should see less failed tasks in the next batches as various fixes have gone in. The long forecast for these batches threw up these errors we'd not seen before on the shorter forecasts run to date. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
A couple more failure cases, both from the same machine and the same symptoms. Tasks 22292258 and 22299451. Both failed with "Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT", and stderr ending: 18:03:00 STEP 2952 H=2952:00 +CPU= 24.459(timings from the second of the tasks cited) I still have the local Event Log for the same task, which has: 27/01/2023 18:04:26 | climateprediction.net | Started upload of oifs_43r3_ps_0194_2000050100_123_969_12185838_1_r841928961_122.zipThe machine has 6 core CPU, 64 GB RAM, 5 IFS tasks running, no other BOINC projects active. The timings don't suggest the machine or comms are under any current stress - I fetched hard when the upload failure occurred last Saturday, and I'm finishing off the work downloaded then. All finished work has been uploaded and reported, so new uploads are going through in order and in real time as they're created. Other tasks finished normally between these two. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,700,823 RAC: 9,977 |
It's happened again, to task 22301685 The messages are the same, and include one I missed earlier - it was present in both the previous failures. Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT <message> Process still present 5 min after writing finish file; aborting </message> 11:41:25 STEP 2952 H=2952:00 +CPU= 24.374 The child process terminated with status: 0 Zipping up the final file: /hdd/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 11:44:49 (93477): called boinc_finish(0) 29/01/2023 11:42:51 | climateprediction.net | Started upload of oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip 29/01/2023 11:43:09 | climateprediction.net | Finished upload of oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip 29/01/2023 11:50:07 | climateprediction.net | Computation for task oifs_43r3_ps_0501_2004050100_123_973_12190145_1 finishedWhat could cause "the process" - I'm assuming the CPDN wrapper - not to close itself within 5 minutes? How can I find out - will it be logged anywhere? |
©2024 cpdn.org