Thread 'OpenIFS Discussion'

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 68141 - Posted: 31 Jan 2023, 9:37:58 UTC - in response to Message 68140. Any idea as to how close things are ... Look at the server status page We're down to 4182, but unfortunately, with the administrative trickle display not functioning, we can't see what timestep any of them have reached. They might be plodding along slowly, they might have finished and just be waiting for the upload server to come back, or they might have been abandoned at the starting gate. The project could see that data, but I suspect they're as much in the dark as we are. ID: 68141 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68145 - Posted: 31 Jan 2023, 9:51:33 UTC - in response to Message 68140. Forthcoming batches Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. .... Aren't there still at least 12000 new tasks to be processed from the current run by the end of February? I believe that was the number when sending out of new work was turned off a week or so ago. Any idea as to how close things are for it to be turned back on? Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. ID: 68145 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68146 - Posted: 31 Jan 2023, 10:01:49 UTC - in response to Message 68145. Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. Wow, I must have missed it. I thought there were some thousands Unsent when things got turned off and I don't remember seeing them reappear. 45k unique tasks, not including reruns? ID: 68146 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68147 - Posted: 31 Jan 2023, 10:40:26 UTC - in response to Message 68146. Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. Wow, I must have missed it. I thought there were some thousands Unsent when things got turned off and I don't remember seeing them reappear. 45k unique tasks, not including reruns? Yes, they did get turned back on. Andy restarted the batches last week and they got sucked up pretty quickly. It's just resends going out now, but I believe the scientist needs to rerun some non-returns for 2021. I was just looking at the batch stats page. All the 2021 batches (3125 total) have returned 90% or better so far, I estimated <5% were lost due to inappropriate perturbations causing model crashes. All the other hindcast years (1000 wus each), 2020-1981 have returned better than 80% with ~10% still in progress. So approx. 35,000 successfully completed model runs (be great to see a map of where all the machines were that run those, I'll see if I can put one together). The total no. of runs is higher but that's harder to work out as the failure rate varied a fair bit. The earlier batches did better than the later ones because of the server uploads. Handwaving, let's say 60% of tasks were sent out again for one reason or another. ID: 68147 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68149 - Posted: 31 Jan 2023, 12:07:56 UTC p.s. CPDN are double checking all the batches were re-enabled. ID: 68149 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085	Message 68163 - Posted: 31 Jan 2023, 20:24:25 UTC - in response to Message 68135. Last modified: 31 Jan 2023, 20:25:27 UTC Glenn Carver wrote: It's not possible to 'filter-out' the triple-errors (if I understand what you mean). I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.). ID: 68163 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68164 - Posted: 31 Jan 2023, 20:54:16 UTC - in response to Message 68163. Glenn Carver wrote: It's not possible to 'filter-out' the triple-errors (if I understand what you mean). I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.). Ok. Yes, we do parse the task fails, there is a python tool that scans the return database for known issues and produces a nice batch analysis (if it wasn't so difficult to attach an image to a forum post I'd put a copy here). But as I said in the earlier post, there is no guarantee the repeat task will suffer the same fate on a different machine even if it looks like it might be reproducible after 1 complete task in a workunit. It's only after 3 task fails in a workunit do we conclude it's an inappropriate perturbation issue - by identical fails I mean the model fails at the same (or very near) timestep in each case. ID: 68164 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68183 - Posted: 2 Feb 2023, 10:18:24 UTC Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them. ID: 68183 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68189 - Posted: 2 Feb 2023, 14:07:35 UTC - in response to Message 68183. Last modified: 2 Feb 2023, 14:25:50 UTC Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them. Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id very careful!). If that doesn't do it, that's also interesting because usually an unkillable stuck process is waiting for a device. I'm not sure if that's why you might be getting more errors. It shouldn't be because each of those processes only manages its own files. However, I can't be sure. Sometimes a reboot is a good thing, completely clears the memory. I have a possible explanation for this which has been fixed in the latest code, though I can't be certain I've eliminated it until we do a bigger test. p.s. process needs to be running and not 'suspended' for the kill. ID: 68189 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68192 - Posted: 3 Feb 2023, 7:35:34 UTC - in response to Message 68189. Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id very careful!). If that doesn't do it, that's also interesting because usually an unkillable stuck process is waiting for a device. I don't think I've ever had to kill a process so don't know of the different ways to do it. I saw that in htop utility there's a Kill option so I tried it by selection both 9 SIGKILL and 15 SIGTERM signals and neither worked. Not sure how different that is from the command line you mention. I'm not sure how to tell if the process is running or suspended. These weren't showing up in BOINC so they weren't suspended in that way. I noticed them by looking in htop for another reason. ID: 68192 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68193 - Posted: 3 Feb 2023, 10:15:36 UTC - in response to Message 68192. Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id very careful!). If that doesn't do it, that's also interesting because usually an unkillable stuck process is waiting for a device. I don't think I've ever had to kill a process so don't know of the different ways to do it. I saw that in htop utility there's a Kill option so I tried it by selection both 9 SIGKILL and 15 SIGTERM signals and neither worked. Not sure how different that is from the command line you mention. I'm not sure how to tell if the process is running or suspended. These weren't showing up in BOINC so they weren't suspended in that way. I noticed them by looking in htop for another reason. Ok. htop 'kill' will do the same thing as 'kill' on the terminal. There is no difference. A 'kill -9' is 'kill with extreme prejudice'. It's a signal the process cannot ignore. Richard and I exchanged some messages on this (he's seen it all before!). Looking at the boinc client code, there is a note in there that LHC have also seen this issue. So (a) the issue has obviously been around a long time, (b) LHC also run large memory jobs, so that might provide a clue. The only thing I can see quickly looking at the code is the client writes a 'boinc_finish' file (I have no idea why it feels the need to do that), and there's a timer involved. I wonder if a previous memory corruption has disrupted either of those two. The good news is that of all of our task failures from these batches, this 'process still running after 5mins' problem only accounts for 2.5% of those fails. That's also the bad news as it'll be hard to track down because it's not reproducible nor happens very often. To make any progress a traceback would be ideal. I had hoped the kill would generate one in the stderr but that didn't work. One way would be attach the 'gdb' debugger to the process and generate a call tree. If anyone knows their way around gdb let me know and I'll send details on how to proceed, though we'll need to have some new batches first. thanks for highlighting this. ID: 68193 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68233 - Posted: 9 Feb 2023, 14:57:24 UTC Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. ID: 68233 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121	Message 68234 - Posted: 9 Feb 2023, 15:07:21 UTC - in response to Message 68233. Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. Can you please fill in missing Details: OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc OpenIFS_BL: ??? GB RAM ??? GB Harddisc Supporting BOINC, a great concept ! ID: 68234 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68235 - Posted: 9 Feb 2023, 15:21:22 UTC - in response to Message 68233. Last modified: 9 Feb 2023, 15:44:46 UTC Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. Interestingly, I have 5 tasks from batch 990. I see the closed batches have gone from the batch statistics page. Also one from 952 running. As that batch is now closed, should I abort? (If answer doesn't come within next couple of hours answer will be academic.) Edit: All perturbed surface. Edit2: I see I have a message from the Moderators email list. 990 is the batch of reruns. ID: 68235 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68241 - Posted: 10 Feb 2023, 8:33:03 UTC - in response to Message 68234. Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. Can you please fill in missing Details: OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc OpenIFS_BL: ??? GB RAM ??? GB Harddisc From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged." ID: 68241 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68243 - Posted: 10 Feb 2023, 12:50:35 UTC Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. 35 of the missing forecast ones now succeeded out of 143 I have three more which should all complete later today. ID: 68243 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68245 - Posted: 10 Feb 2023, 13:37:51 UTC - in response to Message 68241. From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter. I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988. 22306912 12206763 9 Feb 2023, 15:24:41 UTC 10 Feb 2023, 6:24:03 UTC Completed 53,921.10 53,055.71 0.00 OpenIFS 43r3 Perturbed Surface v1.09 x86_64-pc-linux-gnu 22306615 12204746 7 Feb 2023, 22:23:59 UTC 8 Feb 2023, 14:24:05 UTC Completed 55,839.56 55,008.81 0.00 OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu Number of tasks completed 223 Max tasks per day 227 Number of tasks today 0 Consecutive valid tasks 223 Average processing rate 28.23 GFLOPS Average turnaround time 3.32 days OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu Number of tasks completed 1 Max tasks per day 5 Number of tasks today 0 Consecutive valid tasks 1 Average processing rate 29.20 GFLOPS Average turnaround time 0.62 days So yes, MMDV: it is shorter, but not very much. My machine predicted that v1.05 tasks would take a little over 15 hours, and that is what they did. It predicted that the v1.09 job would take a few hours more than two days, but it was about the same as the v1.05 jobs. The poor turnaround time for the v1.05 jobs was due to the upload problem during the time I was running those. ID: 68245 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68246 - Posted: 10 Feb 2023, 13:43:51 UTC I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988. #990 is the batch of 143 reruns from the previous batches. ID: 68246 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68249 - Posted: 10 Feb 2023, 22:30:49 UTC - in response to Message 68246. I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988. #990 is the batch of 143 reruns from the previous batches. Yes, it's the same PS run, finishing up some missing models, not the announced BL run. The difference is that a new app version is being used for those, 1.09 vs. 1.05. From such a small sample size, I'm not sure one can say that 1.09 is faster than 1.05. Glenn just said that the upcoming BL run will have shorter run times which I think is due to the design of this BL run not the newer app version. Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet. ID: 68249 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68250 - Posted: 10 Feb 2023, 23:18:04 UTC - in response to Message 68249. The new versions are because I've fixed various issues. Still getting a few memory corruption fails though which I'm still working on. ID: 68250 · Reply Quote