Thread 'OpenIFS Discussion'

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100	Message 67807 - Posted: 17 Jan 2023, 16:16:06 UTC As threatened, here's a look at the first minute or so of six tasks starting at once in a 64 GB machine. I don't think I've yet seen a working set size above 4.2 GB per task on this measure, but the log is still running and I'll scan through it later. @ Glenn, how long would you expect it to take to reach the first peak memory use? Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 1192.73MB, smoothed 596.37MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.900, kernel CPU 0.660 Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 1190.67MB, smoothed 595.33MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.840, kernel CPU 0.720 Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 1203.55MB, smoothed 601.78MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.930, kernel CPU 0.680 Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 1184.74MB, smoothed 592.37MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.870, kernel CPU 0.630 Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 1188.60MB, smoothed 594.30MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.870, kernel CPU 0.660 Tue 17 Jan 2023 15:59:55 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 1196.59MB, smoothed 598.30MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.900, kernel CPU 0.660 Tue 17 Jan 2023 15:59:55 GMT \| \| [mem_usage] BOINC totals: WS 7156.88MB, smoothed 3578.44MB, swap 9979.33MB, 0.00 page faults/sec Tue 17 Jan 2023 15:59:55 GMT \| \| [mem_usage] All others: WS 2701.28MB, swap 258864.39MB, user 64.660s, kernel 33.670s Tue 17 Jan 2023 15:59:55 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.78% Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 2656.43MB, smoothed 1626.40MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.200, kernel CPU 1.300 Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 2656.42MB, smoothed 1625.88MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.130, kernel CPU 1.320 Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2656.42MB, smoothed 1629.10MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.180, kernel CPU 1.370 Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 2656.42MB, smoothed 1624.40MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.040, kernel CPU 1.240 Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2656.43MB, smoothed 1625.36MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.050, kernel CPU 1.360 Tue 17 Jan 2023 16:00:05 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2656.43MB, smoothed 1627.36MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.180, kernel CPU 1.310 Tue 17 Jan 2023 16:00:05 GMT \| \| [mem_usage] BOINC totals: WS 15938.54MB, smoothed 9758.49MB, swap 19371.55MB, 0.00 page faults/sec Tue 17 Jan 2023 16:00:05 GMT \| \| [mem_usage] All others: WS 2701.54MB, swap 258864.39MB, user 64.900s, kernel 33.810s Tue 17 Jan 2023 16:00:05 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.63% Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 2750.01MB, smoothed 2188.20MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.690, kernel CPU 1.780 Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 2753.11MB, smoothed 2189.49MB, swap 3236.08MB, 0.00 page faults/sec, user CPU 23.730, kernel CPU 1.730 Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2755.95MB, smoothed 2192.52MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.700, kernel CPU 1.780 Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 2746.14MB, smoothed 2185.27MB, swap 3236.08MB, 0.00 page faults/sec, user CPU 23.590, kernel CPU 1.640 Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2755.17MB, smoothed 2190.27MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.550, kernel CPU 1.780 Tue 17 Jan 2023 16:00:15 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2761.35MB, smoothed 2194.36MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.640, kernel CPU 1.740 Tue 17 Jan 2023 16:00:15 GMT \| \| [mem_usage] BOINC totals: WS 16521.73MB, smoothed 13140.11MB, swap 19416.44MB, 0.00 page faults/sec Tue 17 Jan 2023 16:00:15 GMT \| \| [mem_usage] All others: WS 2700.32MB, swap 258862.36MB, user 65.080s, kernel 33.910s Tue 17 Jan 2023 16:00:15 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.47% Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 3414.84MB, smoothed 2801.52MB, swap 3988.23MB, 0.00 page faults/sec, user CPU 32.870, kernel CPU 2.500 Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 3361.16MB, smoothed 2775.33MB, swap 3988.24MB, 0.00 page faults/sec, user CPU 33.040, kernel CPU 2.370 Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2682.67MB, smoothed 2437.59MB, swap 3104.65MB, 0.00 page faults/sec, user CPU 32.870, kernel CPU 2.500 Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 3409.42MB, smoothed 2797.34MB, swap 3988.24MB, 0.00 page faults/sec, user CPU 32.810, kernel CPU 2.350 Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2683.42MB, smoothed 2436.84MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 32.720, kernel CPU 2.560 Tue 17 Jan 2023 16:00:25 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2689.60MB, smoothed 2441.98MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 32.760, kernel CPU 2.500 Tue 17 Jan 2023 16:00:25 GMT \| \| [mem_usage] BOINC totals: WS 18241.10MB, smoothed 15690.61MB, swap 21541.51MB, 0.00 page faults/sec Tue 17 Jan 2023 16:00:25 GMT \| \| [mem_usage] All others: WS 2700.34MB, swap 258862.32MB, user 65.250s, kernel 34.270s Tue 17 Jan 2023 16:00:25 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.88% Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 1636.18MB, smoothed 2218.85MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.490, kernel CPU 2.820 Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 1640.56MB, smoothed 2207.94MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.730, kernel CPU 2.690 Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 1635.78MB, smoothed 2036.69MB, swap 1865.15MB, 0.00 page faults/sec, user CPU 42.630, kernel CPU 2.780 Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 1635.91MB, smoothed 2216.63MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.530, kernel CPU 2.640 Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 1643.66MB, smoothed 2040.25MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.400, kernel CPU 2.840 Tue 17 Jan 2023 16:00:35 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 1741.96MB, smoothed 2091.97MB, swap 1953.28MB, 0.00 page faults/sec, user CPU 42.480, kernel CPU 2.790 Tue 17 Jan 2023 16:00:35 GMT \| \| [mem_usage] BOINC totals: WS 9934.05MB, smoothed 12812.33MB, swap 11303.59MB, 0.00 page faults/sec Tue 17 Jan 2023 16:00:35 GMT \| \| [mem_usage] All others: WS 2700.48MB, swap 258862.32MB, user 65.410s, kernel 34.650s Tue 17 Jan 2023 16:00:35 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.89% Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 4268.91MB, smoothed 3243.88MB, swap 4859.94MB, 0.00 page faults/sec, user CPU 51.590, kernel CPU 3.680 Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 4272.00MB, smoothed 3239.97MB, swap 4859.95MB, 0.00 page faults/sec, user CPU 51.910, kernel CPU 3.460 Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 4274.55MB, smoothed 3155.62MB, swap 4860.07MB, 0.00 page faults/sec, user CPU 51.870, kernel CPU 3.550 Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 4273.29MB, smoothed 3244.96MB, swap 4859.95MB, 0.00 page faults/sec, user CPU 51.720, kernel CPU 3.460 Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 4290.31MB, smoothed 3165.28MB, swap 4859.94MB, 0.00 page faults/sec, user CPU 51.430, kernel CPU 3.740 Tue 17 Jan 2023 16:00:45 GMT \| climateprediction.net \| [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 3569.75MB, smoothed 2830.86MB, swap 4107.27MB, 0.00 page faults/sec, user CPU 51.700, kernel CPU 3.570 Tue 17 Jan 2023 16:00:45 GMT \| \| [mem_usage] BOINC totals: WS 24948.80MB, smoothed 18880.56MB, swap 28407.11MB, 0.00 page faults/sec Tue 17 Jan 2023 16:00:45 GMT \| \| [mem_usage] All others: WS 2700.48MB, swap 258862.32MB, user 65.550s, kernel 34.740s Tue 17 Jan 2023 16:00:45 GMT \| \| [mem_usage] non-BOINC CPU usage: 0.38% Tue 17 Jan 2023 16:00:49 GMT \| \| [mem_usage] enforce: available RAM 57819.62MB swap 1536.00MB ID: 67807 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67817 - Posted: 17 Jan 2023, 20:21:39 UTC - in response to Message 67807. I don't think I've yet seen a working set size above 4.2 GB per task on this measure, but the log is still running and I'll scan through it later. If I just look at the top command results (that for me looks every 10 seconds), I see 4.6 GB on the largest task quite frequently. This is the RES column. ID: 67817 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100	Message 67827 - Posted: 18 Jan 2023, 8:11:27 UTC Another possible memory-related issue. One of the six tasks I started in the simultaneous stress test yesterday (22270385) failed in the very last second. Four of the tasks finished within a five second interval. All the others ended normally, but this one got a "process exited with code 9 (0x9, -247)". The full stderr is present, but the last two uploads were caught in the apparent upload failure this morning. ID: 67827 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,622,177 RAC: 59,768	Message 67859 - Posted: 18 Jan 2023, 18:41:19 UTC - in response to Message 67807. Working set size reported by boinc client is smoothed and on my system a measurement is taken every 10 seconds. This would systematically under-estimate the peak memory usage, which is what matters if we want to make sure the hosts don't ever run out of memory. Even worse, boinc client uses that smoothed working set size for scheduling, which is causing all kinds issues for OpenIFS and forcing us to use app_config, instead of relying on client to handle memory properly. For folks interested in debugging memory usage, I would recommend installing atop or below, both will give you historical snapshot whenever you want to check back. atop is widely available, though you might need to tune the default window to be shorter to be useful. Also note that atop captures per-thread information which could be a lot and can wear out SSDs really fast. I personally use below which doesn't have these problems, but not many distros have them so you might have to install rustc, build and install unit files yourself. There is probably an Ubuntu PPA though. ID: 67859 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 67862 - Posted: 18 Jan 2023, 20:26:10 UTC - in response to Message 67859. Last modified: 18 Jan 2023, 20:26:44 UTC For folks interested in debugging memory usage, I would recommend installing atop or below, both will give you historical snapshot whenever you want to check back. atop is widely available, though you might need to tune the default window to be shorter to be useful. Also note that atop captures per-thread information which could be a lot and can wear out SSDs really fast. I personally use below which doesn't have these problems, but not many distros have them so you might have to install rustc, build and install unit files yourself. There is probably an Ubuntu PPA though. I tried atop and below. below does not work and I do not feel like finding out why. Here is an excerpt from atop. Is the line starting out MEM \| tot 62.4G the one you have in mind? I assume free plus cache to be (about) the amount of RAM available, but over what interval are the values measured? From most recent boot-up? In this case, 5d6h42m18s Or over the interval between repeats of atop? (It appears it is since system boot-up the first time, and over the current interval therafter.) What is shrss? ATOP - localhost 2023/01/18 14:54:47 ----------------- 5d6h42m18s elapsed PRC \| sys 5h38m \| user 92h04m \| #proc 460 \| #trun 13 \| #tslpi 647 \| #tslpu 180 \| #zombie 0 \| clones 423e3 \| no procacct \| CPU \| sys 97% \| user 1094% \| irq 4% \| idle 403% \| wait 2% \| steal 0% \| guest 0% \| \| curf 4.23GHz \| CPL \| avg1 12.36 \| avg5 12.43 \| avg15 12.44 \| \| csw 120772e4 \| \| intr 62723e5 \| \| numcpu 16 \| MEM \| tot 62.4G \| free 4.5G \| cache 36.3G \| dirty 46.3M \| buff 142.3M \| slab 1.4G \| shmem 87.2M \| shrss 2.0M \| numnode 1 \| SWP \| tot 15.6G \| free 14.0G \| swcac 18.7M \| \| \| \| \| vmcom 27.0G \| vmlim 46.8G \| ID: 67862 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,622,177 RAC: 59,768	Message 67865 - Posted: 18 Jan 2023, 21:53:47 UTC - in response to Message 67862. below requires cgroupv2 though it should be default now in most distros. Both are useful to look at per process stats too, like finding out peak RSS for each OpenIFS task. For that we probably need short intervals to be recorded. To look at history, you want to start with `atop -r <timestamp>` though, otherwise the top rows aren't any more useful than top or other tools if you are monitoring live. atop man page explains the meaning, except it doesn't cover shrss either. Guess it's small enough we can just ignore. ID: 67865 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853	Message 67878 - Posted: 19 Jan 2023, 0:08:00 UTC What is shrss? According to the man page for atop on XUbuntu 22.04 it is "the resident size of shared memory (`shrss`)" (same as SHR in top?) Cheers - Al. ID: 67878 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,622,177 RAC: 59,768	Message 67879 - Posted: 19 Jan 2023, 0:25:04 UTC - in response to Message 67865. Turns out if all we care about is just getting RSS of the OpenIFS app for some period, it's much faster to just write a script instead of trying to observe it live or from history. I meant to do this for a while just to understand how much the memory usage swings, and guess the discussion finally pushed me to do that. Shitty script here: https://pastebin.com/GtAiv5XB. One RSS sample per second and total count is in the parentheses. --help has some flags you can tune. Probably lots of rough edges for corner cases and it's Linux only. That's what I got for the current public app after running it for 5 minutes. $ ./boinc_task_memory.py --slot 15 2023-01-18 16:17:51,760 [INFO] pid of slot 15: 495869 2568212 - 2714144: *************** (51) 2714145 - 2860076: (2) 2860077 - 3006008: (0) 3006009 - 3151940: (0) 3151941 - 3297872: (2) 3297873 - 3443804: **** (19) 3443805 - 3589736: * (3) 3589737 - 3735668: * (4) 3735669 - 3881600: (8) 3881601 - 4027532: ********************** (78) 4027533 - 4173464: ********************************* (107) 4173465 - 4319396: * (5) 4319397 - 4465328: (2) 4465329 - 4611260: ** (6) 4611261 - 4757192: * (5) 4757193 - 4903125: ** (8) ID: 67879 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,963,473 RAC: 14,071	Message 67922 - Posted: 20 Jan 2023, 17:27:50 UTC Not sure if this is the right place for this but I have had a task fail with a compute error after the last zip file (122) was written. Stderr message is: <![CDATA[ <message> Process still present 5 min after writing finish file; aborting</message> <stderr_txt> irectory: /var/lib/boinc-client/slots/1/ICMSHhq0f+002316 WU 12189428 task 22274970. ID: 67922 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 67995 - Posted: 23 Jan 2023, 16:22:28 UTC - in response to Message 67922. Last modified: 23 Jan 2023, 16:26:35 UTC Yep, I'm aware of this one. I've seen it happen on one of my machines and from our error stats it's responsible for less than 5% of the total failed tasks. That error message occurs because there are two models executing in the same slot directory. One of the model processes shouldn't be running, so the client kills it when the task itself is complete (because the real model process finished normally). In detail: there are 2 processes involved in a task, one is the model, the other is the controlling process that monitors the model and zips & transfers results to the CPDN server. This controller process also executes the suspend/resume instructions from the client. Sometimes, for some reason, the controller loses track of the model process id. Not sure why. It might be related to the 'memory faults' that also occur because on my machine I had the 'process still running' error right after I saw a task fail with 'double corruption'. So my working theory is that one of the tasks clobbered a bit of memory of another task and the controlling task then couldn't control the model any longer. The boinc client shows the task as suspended, but, it's actually only the controller process that's suspended as the client know nothing about the model. Only the controller does but as it's 'lost' the model, the model runs free. I spotted this because I suspended the project and then wondered why my PC fans were still running. I checked processes with 'ps' and noticed all but one 'oifs_43r3_model.exe' process was still running, all other processes were suspended correctly. If you see this happen, you can safely kill the running model as the client will just start the model up again when the task resumes, which is why a second model process starts up. Just make sure to kill the right process, only the 'oifs_43r3_model.exe' process and not the 'oifs_43r3_ps_1.05_x86_64-pc-linux-gnu' one, as that's the controller. If you kill that by mistake it will abort the task. And make sure the project is suspended, because if not and the model process is killed, that will be detected by the controller and also abort the task. I have fixed a number of issues in the controller code lately and a new version is about to be tested. One was in the process control, though I am not 100% certain it will deal with this issue. Not sure if this is the right place for this but I have had a task fail with a compute error after the last zip file (122) was written. Stderr message is: <![CDATA[ <message> Process still present 5 min after writing finish file; aborting</message> <stderr_txt> irectory: /var/lib/boinc-client/slots/1/ICMSHhq0f+002316 WU 12189428 task 22274970. ID: 67995 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 68003 - Posted: 23 Jan 2023, 17:30:53 UTC As a p.s. to previous message about fixing issues. The forum posts about I/O amount from the model were noted and I have changed the model configuration for future batches so it will produce less logging output. I can't alter the results file sizes nor the checkpoints but the log information contributed a notable %age of read & write I/O. ID: 68003 · Reply Quote

BellyNitpicker Send message Joined: 13 Jun 20 Posts: 6 Credit: 5,301,352 RAC: 176,529	Message 68022 - Posted: 24 Jan 2023, 22:06:46 UTC Re previous observations on OIFS uploads failing, I now have well over a week's worth of pending uploads across two Ubuntu virtual machines - between 500 and 1,000 in all. No storage problems at my end yet, but they are only virtual, and do only have a fraction of a real SSD each. Do we have any news on when uploads might resume? Nick ID: 68022 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 10 Credit: 148,003,836 RAC: 40,207	Message 68023 - Posted: 24 Jan 2023, 23:16:07 UTC - in response to Message 67995. It might be related to the 'memory faults' that also occur because on my machine I had the 'process still running' error right after I saw a task fail with 'double corruption'. I have also seen and killed several detached model.exe processes that seem to occur after the model fails with a "double free or corruption (out)" error. I've started looking with "ps -efl \| grep boinc" whenever I see a task with a computation error. The bad process is pretty easy to find because the parent PID is set to "1", instead of the PID of a controlling process. It also has the same slot number as another process. I suspect that there is a detached process every time the corruption error happens, but I haven't looked consistently enough to be certain. Do you have any insight into how the intermediate data is used? It's easy to imagine looking at final results of 40,000 runs, and it's easy to imagine looking at the intermediate results of a few runs, but I have a hard time imagining sorting through the massive amount of data that we are generating here. ID: 68023 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,807,823 RAC: 19,824	Message 68028 - Posted: 25 Jan 2023, 7:11:36 UTC - in response to Message 68022. Do we have any news on when uploads might resume? The last update posted didn't specify a number just said several days. My guess would be not until next week. ID: 68028 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753	Message 68029 - Posted: 25 Jan 2023, 7:42:18 UTC - in response to Message 68028. Do we have any news on when uploads might resume? The last update posted didn't specify a number just said several days. My guess would be not until next week. That is likely. I or one of the other moderators or Glen will post when we hear anything. First place to look will be the, "Uploads are stuck" thread. ID: 68029 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,807,823 RAC: 19,824	Message 68034 - Posted: 25 Jan 2023, 9:33:41 UTC It seems like even though sending out of work has been paused, that only applies to new work. Reruns are still being sent out. Unfortunately many if not all of them seem to be from users who probably already completed them but haven't been able to upload and report before the deadline. It seems like I was wrong in my confidence that this won't happen. So now others will have to redo the work. At least this time the 30 day grace period applies. ID: 68034 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,698,338 RAC: 10,100	Message 68036 - Posted: 25 Jan 2023, 9:39:37 UTC - in response to Message 68034. Most of the active crunchers will be well into "too many uploads" by now, and active readers of these boards will know the risks of trying to circumvent that limit. I think the continued issue of resends is a very minor concern in the grand scheme of things. ID: 68036 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,807,823 RAC: 19,824	Message 68037 - Posted: 25 Jan 2023, 9:57:00 UTC - in response to Message 68036. Most of the active crunchers will be well into "too many uploads" by now, and active readers of these boards will know the risks of trying to circumvent that limit. I think the continued issue of resends is a very minor concern in the grand scheme of things. Sure, but it'll be whoever gets lucky to upload and report first who gets the credit. Good chance many of the users who did the work first may lose out. ID: 68037 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 68039 - Posted: 25 Jan 2023, 11:03:19 UTC Update: Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks). ID: 68039 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68048 - Posted: 25 Jan 2023, 19:45:30 UTC - in response to Message 68036. Last modified: 25 Jan 2023, 19:48:58 UTC Richard Haselgrove wrote: I think the continued issue of resends is a very minor concern in the grand scheme of things. Except that if two results of the same workunit are attempted to be uploaded, this aggravates upload11.cpdn.org's troubles. This month so far, whenever upload11.cpdn.org was up at all, it _never_ was able to take our result data as fast as we were able to compute them. If we now start to compute redundant tasks, this only gets worse. ID: 68048 · Reply Quote