Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Uh.... that's my point? Windows waits. So why does CPDN fail? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Uh.... that's my point? Windows waits. So why does CPDN fail?I suspect the answer lies in the fact that these programs were written by the met office to run on their supercomputer(s) which were not subject to random shutdowns by either the operating system or users. If they had been written for running on personal computers much would have been done differently. I am reminded of the story of a traveller in a remote rural location asking fore directions and being told, "Well, If I really wanted to be going there, I would not be setting off from here." This is why, while I am sure there are still things to be improved in respect to this issue at least, the OIFS tasks are much better behaved. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
It's true, running on home PCs is a very different environment. Though, as these models are run on thousands of nodes, it was not uncommon for a blade in a node to fail during a forecast (or part of the interconnect to fail).Uh.... that's my point? Windows waits. So why does CPDN fail?I suspect the answer lies in the fact that these programs were written by the met office to run on their supercomputer(s) which were not subject to random shutdowns by either the operating system or users. If they had been written for running on personal computers much would have been done differently. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
But it's Boinc doing things wrong. It acts as an intermediary between Windows and the CPDN app. When Windows says "I'm shutting down", Boinc should tell CPDN to shut down, and then wait until CPDN says it's finished, and only then tell Windows Boinc is ready for shutdown. This should all be part of the Boinc wrapper yours and every project uses. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
But it's Boinc doing things wrong. It acts as an intermediary between Windows and the CPDN app. When Windows says "I'm shutting down", Boinc should tell CPDN to shut down, and then wait until CPDN says it's finished, and only then tell Windows Boinc is ready for shutdown. This should all be part of the Boinc wrapper yours and every project uses. I may misunderstand something. I do not know about Windows, but in Linux, only CPDN tasks have a wrapper. The wrapper is started by the Boinc Client and does not do much. But it forks another process that does almost all the work. So on shutdown, one would expect the system to tell the boinc client to shutdown, then the client would tell the wrapper, unique to CPDN, to shutdown, and perhaps it does. It is then the responsibility of the wrapper to shutdown the working process since the client does not know about it. But on shutdown, it may be that the Linux kernel shuts down the working process first, and then the wrapper, if still running, does not know what to do: it is too late. I do not understand the inner workings of the boinc client, the wrapper and the working task all that well. But none of the other six projects I run work like this. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I was mistaken, I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one. The CPDN app runs natively in whatever OS you have. But wrapper or not, surely Boinc is already set up with signals to give to the app and get back from the app to tell it to shutdown and the app to tell Boinc it's ready to shutdown. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
I was mistaken, I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one. The CPDN app runs natively in whatever OS you have.Hi Peter, it depends on what you mean by a 'wrapper'. To CPDN/boinc a wrapper is a separately running process that controls the models. CPDN *do* use wrappers but maybe not the same that you're referring to. The MetO and OpenIFS models both use wrappers though they function a little differently. CPDN are moving to the OpenIFS way of doing it so I'll focus on that. The 'wrapper' for OpenIFS is a separate process (i.e. program) that talks directly to boinc *and* the model. The model has no knowledge it's running under boinc. The job of the wrapper is to start the model, monitor it, send output back to cpdn and kill it if the boinc client asks. We could put all this inside a virtual machine (like LHC), but this wouldn't be a 'wrapper' in software engineering terms. --- CPDN Visiting Scientist |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Put things in a virtual machine as a last resort. Virtualbox causes endless problems and slows the OS interface down especially if multiple ones are running. LHC also hates shutdowns. Not sure if that's virtualbox or LHC's fault. But randomly they get computation errors on restart. It appears it's shutting down nicely, when I shutdown windows, it says "virtualbox is saving state" for each running task. Those quickly go away, then I'm left with "virtualbox has active connections", and nobody knows why. Someone told me it has to do with an internet connection left open to LHC, but I have no idea what I'm supposed to do about it, so I ignore it. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
The researcher for the EAS tasks has discussed the results with her professor and there were some concerns about the spin up results but they have decided everything is within range and the mainsite tasks will be released, "very soon." Do we have an ETA yet? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Do we have an ETA yet?No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra. |
Send message Joined: 2 Oct 06 Posts: 54 Credit: 27,309,613 RAC: 28,128 |
Do we have an ETA yet?No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra. Got it. Thanks for the update, even if the update is "no new news". |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,476,460 RAC: 15,681 |
Some of the CPDN folk are on their holidays now, so I don't expect things to more very quickly.Got it. Thanks for the update, even if the update is "no new news".Do we have an ETA yet?No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra. --- CPDN Visiting Scientist |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Some of the CPDN folk are on their holidays now, so I don't expect things to more very quickly.Enjoying the global warming heatwave! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one. Well, it is pretty obvious that CPDN does use wrappers, if we are talking about the same thing. The wrapper is started by the Boinc-Client. The wrapper spins off the actual program that does most of the work. As far as I can tell, the wrapper compresses trickles and results and sends them back to the server. My impression may not be correct in the details, but that is the general idea. The Boinc-Client does not know anything about the spun-offed process (the wrapped process) at all. And all this applies both to the Oifs work and the traditional work too. The only other case of wrappers that I am aware of are the SCC1 tasks on WCG, but they are very different from any others. They do not appear on the "top" list, but you can find them in the "pstree" list. That is in Linux. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Well, it is pretty obvious that CPDN does use wrappers, if we are talking about the same thing. The wrapper is started by the Boinc-Client. The wrapper spins off the actual program that does most of the work. As far as I can tell, the wrapper compresses trickles and results and sends them back to the server. My impression may not be correct in the details, but that is the general idea. The Boinc-Client does not know anything about the spun-offed process (the wrapped process) at all.Ok maybe it's an option, I read about them somewhere on the Boinc website and thought you had to use one. Must be just one method of interfacing your program to Boinc, and most projects program them to talk directly. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
Although not related to new work but following on from the last couple of posts, CMDock uses a wrapper and it shows under Linux, I believe that YAFU also uses a wrapper and possibly YOYO, SRBase, TNGrid? and a few others. In some cases it is needed due to the type of programme being used or the code it has been written in. A few other projects also use a "Trickle up" method to keep the Server updated with progress (Primegrid is one) and some of these projects need a wrapper for this purpose. Conan |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I've received some WAH Windows tasks. 7 of them are running ok, but about 20 got a computation error within minutes. Let me know if you want details, or I assume the programmers can look at the errors upon return (I've sent them back). Not sure what the cause is. My best machine is running all of it's 5 correctly. The two running 1 correctly caused several failures. They're slower and older, but since they're managing one, I can't see it's their fault. Strange my best machine got none wrong. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Looks like the new WAH batch only works on better machines. To establish what's wrong, here's my lot and Mikey's lot. We have both had crashes on the older machines, but the newer ones are crunching through them ok so far (mine have been running for several hours, the crashes occur in a few minutes). Mine: https://www.cpdn.org/hosts_user.php?userid=2002390 Mikey's: https://www.cpdn.org/hosts_user.php?userid=1976984 Here's an error output from one of mine which crashes on a Xeon X5650 (old 12 core CPU): https://www.cpdn.org/result.php?resultid=22325797 |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
Looks like intel(r) xeon(r) cpu x5650 doesn't have AVX support. |
Send message Joined: 22 Feb 11 Posts: 32 Credit: 226,546 RAC: 4,080 |
Task should retain previous checkpoint and if on restart it detects that checkpoint is corrupted it should use previous checkpoint. |
©2024 cpdn.org