Message boards :
Number crunching :
w/u failed at the 89th zip file
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 29 Nov 17 Posts: 81 Credit: 13,191,773 RAC: 88,330 |
Well BOINC code comments say: // If we already found a finish file, abort the app; // it must be hung somewhere in boinc_finish(); I do find this comment thought-provoking: // process is still there 5 min after it wrote finish file. // abort the job // Note: actually we should treat it as successful. // But this would be tricky. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,585,599 RAC: 15,925 |
That line was added, and the timeout increased, around 4 years ago: https://github.com/BOINC/boinc/commit/db4c3d0c22d772f77d6d65e6adf9f23280530a7f client: increase finish-file timeout'ensures' --> 'assures'? 'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too. My machine has a 512 GByte solid state drive, but other than the Boinc client software, and the swap space, all the Boinc stuff is on a 5400 rpm SATA spinning drive. Yet I do next to no paging. I have no CPDN tasks running at the moment because there are none. I started getting swap space usage when I had about 20 completed tasks that were unable to upload their "trickles" for about a week. I have no idea what was on that swap space. There was certainly no thrashing of running Boinc tasks. This is my current RAM usage: $ free -hw total used free shared buffers cache available Mem: 62Gi 4.0Gi 1.0Gi 85Mi 162Mi 57Gi 57Gi Swap: 15Gi 3.0Mi 15Gi |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
With 1 cpdn task running by its self it failed after zip no 83 with 13:30:43 STEP 2039 H=2039:00 +CPU= 18.156 double free or corruption (out) I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
With 1 cpdn task running by its self it failed after zip no 83 with It is really puzzling to me that I have such good luck with these, and others have bad. I really wonder what the difference is. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.52 GB OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu Number of tasks completed 208 Max tasks per day 212 Number of tasks today 0 Consecutive valid tasks 208 Average processing rate 28.33 GFLOPS Average turnaround time 3.74 days |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,663,251 RAC: 14,512 |
Current error free run is now up to 18. I think all of the errors have been while I was running four or more tasks at once. I am now sticking to a maximum of two because even that is slightly more than my ADSL can cope with. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Jean-David Beyer wrote: It is really puzzling to me that I have such good luck with these, and others have bad.While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter. (I for one am one of the users who complete comparably few tasks, because of my upload bandwidth limit, which means I don't run a lot of tasks while the upload server is up, and am down to running only 1 "pilot" task per computer while the worthless upload server is down. Which it is most of the time. Plus, by now I practically never suspend a task to disk, only to RAM. Consequently, I had very few errors until now.) |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
It is really puzzling to me that I have such good luck with these, and others have bad. I do not know about statistics -- and I have a BA in mathematics and took both a course in statistics, and another in probability. Is 208 consecutive tasks out of 208 total tasks a few or a lot? in any case, 100% success rate seems pretty good. How many need I run before I get a failure? I seem to run them faster than the server delivers them, even when periods of time occur when the upload server will not accept the "trickles." And in many cases, two prior users have attempted the same work unit and failed. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Jean-David Beyer wrote: Is 208 consecutive tasks out of 208 total tasks a few or a lot?It's all relative. Right now I am seeing this on your single Linux host: oifs_43r3_ps results: 244 total, 243 valid, 1 in progress This is mine across 2 (earlier partly 3) hosts: oifs_43r3_ps results: 1493 total, 954 valid, 484 in progress (would have been done by now if not for the permanent upload server absence), 55 error My last errors were from November mostly, when I shutdown and resumed one of my hosts. Then a few errors from Dec 1, one from Dec 3, and no error since. But I have successfully avoided to suspend tasks to disk ever since November, with the exception of 1 deliberate test which AFAICT didn't fail. (Might still fail, if it is among the pending uploads.) My upload link width allows me to return 48 results per day. (This is rather little relative to the CPUs, RAM, and disk space which I could spare.) This means my 954 valid tasks translate to merely 20 days production. The rest was server downtime. My best CPUs are 32 core CPUs, of which one alone could produce slightly more than 48 results per day. Some folks have even larger CPUs, or similarly large ones but with higher power budget than mine. If we go by credit of the last week or last month, my 48 results/day during the brief times when the upload server is functioning put me above the average. But a few big producers are missing from these 3rd party stats because they didn't enable statistics export. E.g., based on last week's credit of my team, there was 1.3 M credit given to one or more users on my team without stats export, compared to my 400 k or your 80 k of last week. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Old friend is back..... "double free or corruption (out)" I have been missing these. |
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
At least this w/u ran to the end before aborting with "Process still present 5 min after writing finish file; aborting</message>£ No other error messages. Only 50% success so far with the latest bunch. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
At least this w/u ran to the end before aborting withThat error message points to an issue in the boinc client. The task has finished and told the client but then it gets stuck, somewhere in the client code. CPDN isn't the only project to see this behaviour but I've not seen any good explanation for why on the forums. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,193,804 RAC: 2,852 |
Only 50% success so far with the latest bunch. Same here, but the failures all came first. They all had very short execution times. 22317174 12214156 27 Feb 2023, 2:24:01 UTC 27 Feb 2023, 17:23:19 UTC Completed 51,028.30 50,403.60 0.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22316084 12214703 24 Feb 2023, 22:24:03 UTC 25 Feb 2023, 13:43:22 UTC Completed 53,538.60 52,734.92 2,353.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314976 12213630 24 Feb 2023, 12:25:29 UTC 25 Feb 2023, 3:03:19 UTC Completed 52,615.79 51,784.63 2,353.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314676 12213385 22 Feb 2023, 6:23:59 UTC 22 Feb 2023, 7:24:41 UTC Error while computing 66.16 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314647 12213316 22 Feb 2023, 3:24:44 UTC 22 Feb 2023, 3:49:31 UTC Error while computing 66.61 1.28 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314608 12213345 22 Feb 2023, 0:25:23 UTC 22 Feb 2023, 1:23:20 UTC Error while computing 66.38 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
I would expect that. The perturbations are most likely to generate instability soon after the model gets started. Once the model has balanced its mass & wind fields it will run on ok.Only 50% success so far with the latest bunch.Same here, but the failures all came first. They all had very short execution times. The baroclinic life-cycle experiment was different. In that setup, it started from a very simple atmospheric state which was then perturbed to generate large atmospheric 'storms'. Some of which became too strong for the model to resolve with the timestep length it was using. So for those batches, the model would tend to fail nearer the end of the run. --- CPDN Visiting Scientist |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,585,599 RAC: 15,925 |
In this case, it's much easier than that. The older tasks, like WU 12213345 (issued on 22 Feb) are resends from the failed batch 992, which was withdrawn because of a missing data file in the package. The newer tasks, like WU 12213630 (issued on 24 Feb) are from the corrected replacement batch issued on that day. |
©2024 cpdn.org