Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,713,248 RAC: 8,607 |
Any idea as to how close things are ...Look at the server status page We're down to 4182, but unfortunately, with the administrative trickle display not functioning, we can't see what timestep any of them have reached. They might be plodding along slowly, they might have finished and just be waiting for the upload server to come back, or they might have been abandoned at the starting gate. The project could see that data, but I suspect they're as much in the dark as we are. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,036,322 RAC: 19,542 |
Forthcoming batches Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,866,635 RAC: 19,331 |
Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. Wow, I must have missed it. I thought there were some thousands Unsent when things got turned off and I don't remember seeing them reappear. 45k unique tasks, not including reruns? |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
Yes, they did get turned back on. Andy restarted the batches last week and they got sucked up pretty quickly. It's just resends going out now, but I believe the scientist needs to rerun some non-returns for 2021.Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line. I was just looking at the batch stats page. All the 2021 batches (3125 total) have returned 90% or better so far, I estimated <5% were lost due to inappropriate perturbations causing model crashes. All the other hindcast years (1000 wus each), 2020-1981 have returned better than 80% with ~10% still in progress. So approx. 35,000 successfully completed model runs (be great to see a map of where all the machines were that run those, I'll see if I can put one together). The total no. of runs is higher but that's harder to work out as the failure rate varied a fair bit. The earlier batches did better than the later ones because of the server uploads. Handwaving, let's say 60% of tasks were sent out again for one reason or another. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
p.s. CPDN are double checking all the batches were re-enabled. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,304,435 RAC: 83 |
Glenn Carver wrote: It's not possible to 'filter-out' the triple-errors (if I understand what you mean).I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.). |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
Glenn Carver wrote:Ok. Yes, we do parse the task fails, there is a python tool that scans the return database for known issues and produces a nice batch analysis (if it wasn't so difficult to attach an image to a forum post I'd put a copy here). But as I said in the earlier post, there is no guarantee the repeat task will suffer the same fate on a different machine even if it looks like it might be reproducible after 1 complete task in a workunit. It's only after 3 task fails in a workunit do we conclude it's an inappropriate perturbation issue - by identical fails I mean the model fails at the same (or very near) timestep in each case.It's not possible to 'filter-out' the triple-errors (if I understand what you mean).I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.). |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,866,635 RAC: 19,331 |
Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them.Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device. I'm not sure if that's why you might be getting more errors. It shouldn't be because each of those processes only manages its own files. However, I can't be sure. Sometimes a reboot is a good thing, completely clears the memory. I have a possible explanation for this which has been fixed in the latest code, though I can't be certain I've eliminated it until we do a bigger test. p.s. process needs to be running and not 'suspended' for the kill. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,866,635 RAC: 19,331 |
Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device. I don't think I've ever had to kill a process so don't know of the different ways to do it. I saw that in htop utility there's a Kill option so I tried it by selection both 9 SIGKILL and 15 SIGTERM signals and neither worked. Not sure how different that is from the command line you mention. I'm not sure how to tell if the process is running or suspended. These weren't showing up in BOINC so they weren't suspended in that way. I noticed them by looking in htop for another reason. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
Ok. htop 'kill' will do the same thing as 'kill' on the terminal. There is no difference. A 'kill -9' is 'kill with extreme prejudice'. It's a signal the process cannot ignore.Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device. Richard and I exchanged some messages on this (he's seen it all before!). Looking at the boinc client code, there is a note in there that LHC have also seen this issue. So (a) the issue has obviously been around a long time, (b) LHC also run large memory jobs, so that might provide a clue. The only thing I can see quickly looking at the code is the client writes a 'boinc_finish' file (I have no idea why it feels the need to do that), and there's a timer involved. I wonder if a previous memory corruption has disrupted either of those two. The good news is that of all of our task failures from these batches, this 'process still running after 5mins' problem only accounts for 2.5% of those fails. That's also the bad news as it'll be hard to track down because it's not reproducible nor happens very often. To make any progress a traceback would be ideal. I had hoped the kill would generate one in the stderr but that didn't work. One way would be attach the 'gdb' debugger to the process and generate a call tree. If anyone knows their way around gdb let me know and I'll send details on how to proceed, though we'll need to have some new batches first. thanks for highlighting this. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,830,697 RAC: 44,803 |
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. Can you please fill in missing Details: OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc OpenIFS_BL: ??? GB RAM ??? GB Harddisc Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,036,322 RAC: 19,542 |
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.Interestingly, I have 5 tasks from batch 990. I see the closed batches have gone from the batch statistics page. Also one from 952 running. As that batch is now closed, should I abort? (If answer doesn't come within next couple of hours answer will be academic.) Edit: All perturbed surface. Edit2: I see I have a message from the Moderators email list. 990 is the batch of reruns. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,866,635 RAC: 19,331 |
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged." |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,036,322 RAC: 19,542 |
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app. 35 of the missing forecast ones now succeeded out of 143 I have three more which should all complete later today. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter. I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988. 22306912 12206763 9 Feb 2023, 15:24:41 UTC 10 Feb 2023, 6:24:03 UTC Completed 53,921.10 53,055.71 0.00 OpenIFS 43r3 Perturbed Surface v1.09 x86_64-pc-linux-gnu 22306615 12204746 7 Feb 2023, 22:23:59 UTC 8 Feb 2023, 14:24:05 UTC Completed 55,839.56 55,008.81 0.00 OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu Number of tasks completed 223 Max tasks per day 227 Number of tasks today 0 Consecutive valid tasks 223 Average processing rate 28.23 GFLOPS Average turnaround time 3.32 days OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu Number of tasks completed 1 Max tasks per day 5 Number of tasks today 0 Consecutive valid tasks 1 Average processing rate 29.20 GFLOPS Average turnaround time 0.62 days So yes, MMDV: it is shorter, but not very much. My machine predicted that v1.05 tasks would take a little over 15 hours, and that is what they did. It predicted that the v1.09 job would take a few hours more than two days, but it was about the same as the v1.05 jobs. The poor turnaround time for the v1.05 jobs was due to the upload problem during the time I was running those. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,036,322 RAC: 19,542 |
I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988.#990 is the batch of 143 reruns from the previous batches. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,866,635 RAC: 19,331 |
I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988.#990 is the batch of 143 reruns from the previous batches. Yes, it's the same PS run, finishing up some missing models, not the announced BL run. The difference is that a new app version is being used for those, 1.09 vs. 1.05. From such a small sample size, I'm not sure one can say that 1.09 is faster than 1.05. Glenn just said that the upcoming BL run will have shorter run times which I think is due to the design of this BL run not the newer app version. Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,461,507 RAC: 15,627 |
The new versions are because I've fixed various issues. Still getting a few memory corruption fails though which I'm still working on. |
©2024 cpdn.org