climateprediction.net (CPDN) home page
Thread 'w/u failed at the 89th zip file'

Thread 'w/u failed at the 89th zip file'

Message boards : Number crunching : w/u failed at the 89th zip file
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 67915 - Posted: 19 Jan 2023, 22:34:00 UTC

This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error

"double free or corruption (out)"

Anybody had one of these? Just curious what it might mean??
Ta
Nairb
ID: 67915 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4537
Credit: 19,001,532
RAC: 21,726
Message 67919 - Posted: 20 Jan 2023, 6:17:24 UTC - in response to Message 67915.  

This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error

"double free or corruption (out)"

Anybody had one of these? Just curious what it might mean??
Ta
Nairb
Don't have time to search it out right now but there is something about this in the OIFS discussion thread.
ID: 67919 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 67920 - Posted: 20 Jan 2023, 7:27:28 UTC - in response to Message 67915.  

nairb,
A couple of things to consider, in order of priority:
1) How many concurrent tasks are you running? With the amount of RAM that you have, running more than 3 at a time will likely lead to high failure rate. Two may not be a bad idea if your PC is heavily used for other things.
2) Do you have "Leave non-GPU tasks in memory while suspended" enabled in Computing preferences? It's highly recommended, especially if the tasks often get interrupted for any reason like task swapping, BOINC/PC restarts.
ID: 67920 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 67925 - Posted: 20 Jan 2023, 20:44:45 UTC - in response to Message 67920.  

2) Do you have "Leave non-GPU tasks in memory while suspended" enabled in Computing preferences? It's highly recommended, especially if the tasks often get interrupted for any reason like task swapping, BOINC/PC restarts.


Yes its ticked. It's a dedicated machine and I try to ensure that once a climate task starts its not suspended by other work and runs to completion.

When I checked the machine today, the machine seemed to be running almost idle with 2 tasks using almost no cpu time.
It needed a hard reboot. I should have done a memory check but it looks to have come back to life, but has dumped 2 of the working w/u's with computation errors.

When it's cleared the running jobs I will run a memory checker just to be sure.

I do tend to load the thing with 4 climate jobs and 4 WCG jobs at once....... its done ok so far. But maybe I have been lucky.
ID: 67925 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 67929 - Posted: 21 Jan 2023, 8:13:08 UTC - in response to Message 67925.  

Yes its ticked. It's a dedicated machine and I try to ensure that once a climate task starts its not suspended by other work and runs to completion.

It looks like you have the 2nd point taken care of, which is good.

As for the first one... Looking at your history of OIFS tasks, the failure rate is ~26% (10/39), which is high. Glenn said that the goal he'd like to reach is under 5%. High failure rate is bad for the project (no result to scientist) and the user (wasted CPU time and power). I don't think I've done any WCG CPU work since they shut down for the move so I don't know how demanding their tasks are, especially for RAM. Regardless, it'd seem reasonable to me to go down to 3 OIFS tasks and let things run like that for a good week and see what the failure rate is for that time period. If it's above 5%, go down to 2 or reduce other high RAM usage of that PC. Keep trying until you find a workload that produced under 5% failure rate.

I believe I found the sweet spot for my 2 PCs but I still have to do some observation of one of them to make sure.
ID: 67929 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,477,851
RAC: 1,712
Message 67930 - Posted: 21 Jan 2023, 8:36:39 UTC

I second the notion of decreasing the memory load.
With 'just' 24 gigs of RAM I'd reduce CPDN to three tasks or drop WCG altogether for the moment.
- - - - - - - - - -
Greetings, Jens
ID: 67930 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67932 - Posted: 21 Jan 2023, 12:41:57 UTC
Last modified: 21 Jan 2023, 12:52:17 UTC

"double free or corruption (out)" means:
– Either the program attempted to free (i.e., to deallocate) a memory segment more than once.
– Or something illegally overwrote certain data right before the memory segment which was to be freed.

Could be a programming error. Or could be a secondary symptom of some earlier program failure.
(Edit: Can also be caused by hardware defects, e.g. overclocked RAM or CPU, but IIRC this failure mode has also been seen on Xeon hosts which seem unlikely to be operated in an unstable manner or with defects.)

nairb wrote:
Anybody had one of these?
There are several more reports of this in this message board, and occurrences in stderr.txt of several failed tasks of users who mentioned that they had failures.
ID: 67932 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67933 - Posted: 21 Jan 2023, 14:30:13 UTC - in response to Message 67929.  

As for the first one... Looking at your history of OIFS tasks, the failure rate is ~26% (10/39), which is high. Glenn said that the goal he'd like to reach is under 5%. High failure rate is bad for the project (no result to scientist) and the user (wasted CPU time and power). I don't think I've done any WCG CPU work since they shut down for the move so I don't know how demanding their tasks are, especially for RAM. Regardless, it'd seem reasonable to me to go down to 3 OIFS tasks and let things run like that for a good week and see what the failure rate is for that time period. If it's above 5%, go down to 2 or reduce other high RAM usage of that PC. Keep trying until you find a workload that produced under 5% failure rate.


I have done WCG CPU work since their shutdown. Right now I am running 5 OIFS tasks, 4 WCG CPU tasks, 1 Einstein task, and 2 MilkyWay tasks on my machine and that is not giving me any errors.

On the CPDN web site, most of my completed tasks have received credit, but I do not get any credit on the Statistics tab in a very long time. BoincStats seems to know, but the Projects tab does not know either.

My machine is pretty much like this:

Computer 1511241

Total credit 	6,749,805
Average credit 	50.54
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	62.4 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	475.26 GB
Measured floating point speed 	6.05 billion ops/sec
Measured integer speed 	24.32 billion ops/sec
Average upload rate 	4801.6 KB/sec
Average download rate 	6929.83 KB/sec
Average turnaround time 	2.68 days

ID: 67933 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,330,034
RAC: 10,258
Message 67934 - Posted: 21 Jan 2023, 14:48:04 UTC - in response to Message 67915.  

I've had a couple of these errors on an ubuntu VM. The advice from Glenn was to allow 5GB per IFS task and run a maximum number of openIFS tasks equal to n-1 physical cpu cores. To avoid any unexpected compatibility issues I also dropped running any other projects.

With a 4-core i7 cpu and 24GB memory that would be 3 concurrent openIFS tasks; with memory headroom and one physical cpu core for whatever else the computer system wants to run.

This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error

"double free or corruption (out)"

Anybody had one of these? Just curious what it might mean??
Ta
Nairb

ID: 67934 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 67935 - Posted: 21 Jan 2023, 16:05:54 UTC - in response to Message 67933.  

On the CPDN web site, most of my completed tasks have received credit, but I do not get any credit on the Statistics tab in a very long time. BoincStats seems to know, but the Projects tab does not know either.
My total credit is completely consistent across my BOINC Manager (both tabs), this web site (below my name to the left and on my account page), and at BOINCstats.

My average credit - aka RAC - on the other hand has been falling, because average credit for the IFS tasks is not being calculated as it should be.
ID: 67935 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67939 - Posted: 21 Jan 2023, 18:11:35 UTC

"double free or corruption (out)" is not caused by lack of free RAM.
It's something else.
ID: 67939 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 67958 - Posted: 22 Jan 2023, 8:15:37 UTC - in response to Message 67939.  

"double free or corruption (out)" is not caused by lack of free RAM.
It's something else.

If not directly then perhaps indirectly? Could it be that pushing the RAM limits is more likely to bring out these types of memory related problems?
ID: 67958 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 1,065,239
RAC: 556
Message 67962 - Posted: 22 Jan 2023, 9:07:39 UTC

double free or corruption (...)


Like many other pages this one explains what usually causes that error and what to do to avoid it:
https://linuxhint.com/double-free-corruption-error/

Looks like the code of the scientific app needs to be revised to ensure correct pointer assignment.
ID: 67962 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67963 - Posted: 22 Jan 2023, 9:31:55 UTC - in response to Message 67958.  
Last modified: 22 Jan 2023, 9:36:37 UTC

AndreyOR wrote:
xii5ku wrote:
"double free or corruption (out)" is not caused by lack of free RAM.
It's something else.
If not directly then perhaps indirectly? Could it be that pushing the RAM limits is more likely to bring out these types of memory related problems?
It's hard to tell but I believe that this is unlikely. Keep in mind: When there is a lack of free RAM but some processes on the system request new RAM allocations, what then follows is not that the allocations fail.¹ Instead, the OS first tries to make more RAM available by swapping less recently accessed pages out to swap space (at the price of the entire system becoming less and less responsive, possibly to a degree that users mistake the system for being frozen entirely). When all swap space is used up (or if there isn't any swap space attached in the first place), then the kernel proceeds to act on the out-of-memory situation by picking processes with large memory footprint and terminating them. (This is known as Linux' "OOM killer". That's as if SIGKILL was sent to the affected process, which the process cannot catch. Therefore the process doesn't have a chance anymore to exercise any –possibly buggy– code paths. The process simply goes away immediately.)²

That said, the period during which swap space is started to be used and system responsiveness is being degraded, could uncover bugs or misbehaviour in programs with realtime functionality. An example of such functionality could be I/O watchdogs which consider an I/O operation to be failed if it doesn't succeed within a certain time frame. (In turn, the handling of such assumed failure could easily contain bugs, such as memory corruptions, because error handling code paths like this may be rarely exercised in testing.) I don't know if this class of "realtime functionality" is relevant to the OpenIFS application or the CPDN wrapper.

________
¹) A failed memory allocation would lead to various random program misbehaviour: The program might catch the failure but might not have a good strategy to back out of such a situation. Or the allocation error handler might contain a programming bug. Or the program may not check for failure of the allocation attempt and use the returned error pointer as if it was pointing to successfully allocated memory. In the latter case, the program would most likely crash with a segfault. But see next footnote.

²) Consequently, if a program attempts to allocate memory when the system doesn't have any available anymore, two things can happen: Either the allocation succeeds but the required system call takes a rather long time to complete. Or the process which performs the allocation attempt is terminated by the OOM killer. That is, on Linux, memory allocation requests by userspace processes never fail with a returned error pointer.

AFAIK.
ID: 67963 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,816,935
RAC: 19,934
Message 67967 - Posted: 22 Jan 2023, 10:25:35 UTC
Last modified: 22 Jan 2023, 10:27:13 UTC

double free or corruption (out)

I just had 2 tasks fail with this error on a system that's been error free for the past ~3 weeks since I reduced the number of concurrent tasks. A very cursory look seems to show that it's a problem that's not that uncommon. I don't know if it's frequent enough that it might need to be addressed in the code if we're to keep the error rate under 5%.
ID: 67967 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 67988 - Posted: 23 Jan 2023, 15:02:27 UTC - in response to Message 67915.  

I'm aware of these errors and have been looking into them. You may also see an error message 'free(): ....' at the end as well. Looking at the failure stats for all the batches, it's responsible for ~10% of the total task failures we see. It tends to occur more on Ryzen chips (and possibly Xeon's but I've not completed the analysis), probably because of the larger shared cache.

As has been mentioned, decreasing memory pressure tends to help -- decreases the risk of the bad address picking up wrong data. So if it happens too often, try running 1 less CPDN task. That helped me with my Ryzen 5600G, it's rare I see this on my Intel machines.

Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from. So it's annoying, we are looking into it but hard to say when it'll be fixed.

This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error

"double free or corruption (out)"

Anybody had one of these? Just curious what it might mean??
Ta
Nairb
ID: 67988 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 68027 - Posted: 25 Jan 2023, 1:56:03 UTC - in response to Message 67988.  


Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from.


Yup. I agree it can be difficult to track down. I did work for years on call centre kit with over a 1000 concurrent users. We did test for 8000 concurrent jobs on the machines. With multiple layers of software it was a challenge to find the culprit with a memory leak/corruption problem.

I always tried to get the application programming teams to "try" and give informative error messages........... not always seen as the most important issue. But a useful error message can save endless hours later!!!.

The machine I use for cpdn seems able to run any combination of projects without issues. 8 of anything seems ok, and they all seem to recover from a power cut..... Unlike some cpdn w/u.
With 4 cpdn w/u running at once it uses very little swap space and usually shows about 5~6 gig of memory free. I know peak usage will vary.
Anyway, I hope the bug is found, since it will save a lot of frustration for everyone
ID: 68027 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 68049 - Posted: 25 Jan 2023, 19:47:32 UTC - in response to Message 68027.  

Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from.
Yup. I agree it can be difficult to track down. I did work for years on call centre kit with over a 1000 concurrent users. We did test for 8000 concurrent jobs on the machines. With multiple layers of software it was a challenge to find the culprit with a memory leak/corruption problem.

I've now found & fixed all the memory leaks in our control code wrapper. A new version is about to be tested and will be used for upcoming batches. There are still leaks coming from the boinc_zip part of the boinc software which need further investigation - which I am not about to do right now. Anyway, better to test the new version and see if we still get corruption.

I always tried to get the application programming teams to "try" and give informative error messages........... not always seen as the most important issue. But a useful error message can save endless hours later!!!.
Yep I was always campaigning for 'user focussed error messages' not developer focussed ones!. The problem with memory corruption though is it often corrupts the stack, so error messages never make it out, or give the chance for the system to flush the buffers.

The machine I use for cpdn seems able to run any combination of projects without issues. 8 of anything seems ok, and they all seem to recover from a power cut..... Unlike some cpdn w/u.
With 4 cpdn w/u running at once it uses very little swap space and usually shows about 5~6 gig of memory free. I know peak usage will vary.
Anyway, I hope the bug is found, since it will save a lot of frustration for everyone
Avoid swapping like the plague for OpenIFS tasks. It's a large memory model and will seriously affect performance of the task (and your machine), if it has to start swapping. Always make sure there's enough memory headroom.

Should see less failed tasks in the next batches as various fixes have gone in. The long forecast for these batches threw up these errors we'd not seen before on the shorter forecasts run to date.
ID: 68049 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 68085 - Posted: 27 Jan 2023, 18:45:42 UTC

A couple more failure cases, both from the same machine and the same symptoms.

Tasks 22292258 and 22299451. Both failed with "Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT", and stderr ending:

18:03:00 STEP 2952 H=2952:00 +CPU= 24.459
The child process terminated with status: 0
Zipping up the final file: /hdd/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0194_2000050100_123_969_12185838_1_r841928961_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
18:06:24 (76323): called boinc_finish(0)
(timings from the second of the tasks cited)

I still have the local Event Log for the same task, which has:

27/01/2023 18:04:26 | climateprediction.net | Started upload of oifs_43r3_ps_0194_2000050100_123_969_12185838_1_r841928961_122.zip
27/01/2023 18:04:58 | climateprediction.net | Finished upload of oifs_43r3_ps_0194_2000050100_123_969_12185838_1_r841928961_122.zip
27/01/2023 18:11:42 | climateprediction.net | Computation for task oifs_43r3_ps_0194_2000050100_123_969_12185838_1 finished
The machine has 6 core CPU, 64 GB RAM, 5 IFS tasks running, no other BOINC projects active. The timings don't suggest the machine or comms are under any current stress - I fetched hard when the upload failure occurred last Saturday, and I'm finishing off the work downloaded then. All finished work has been uploaded and reported, so new uploads are going through in order and in real time as they're created. Other tasks finished normally between these two.
ID: 68085 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,700,823
RAC: 9,977
Message 68103 - Posted: 29 Jan 2023, 12:40:01 UTC

It's happened again, to task 22301685

The messages are the same, and include one I missed earlier - it was present in both the previous failures.

Exit status	194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
<message>
Process still present 5 min after writing finish file; aborting
</message>
  11:41:25 STEP 2952 H=2952:00 +CPU= 24.374
The child process terminated with status: 0
Zipping up the final file: /hdd/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
11:44:49 (93477): called boinc_finish(0)
29/01/2023 11:42:51 | climateprediction.net | Started upload of oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip
29/01/2023 11:43:09 | climateprediction.net | Finished upload of oifs_43r3_ps_0501_2004050100_123_973_12190145_1_r609007247_122.zip
29/01/2023 11:50:07 | climateprediction.net | Computation for task oifs_43r3_ps_0501_2004050100_123_973_12190145_1 finished
What could cause "the process" - I'm assuming the CPDN wrapper - not to close itself within 5 minutes? How can I find out - will it be logged anywhere?
ID: 68103 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : w/u failed at the 89th zip file

©2024 cpdn.org