climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 42 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68899 - Posted: 9 Jun 2023, 8:14:32 UTC - in response to Message 68898.  

Uh.... that's my point? Windows waits. So why does CPDN fail?
ID: 68899 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68900 - Posted: 9 Jun 2023, 10:29:28 UTC - in response to Message 68899.  

Uh.... that's my point? Windows waits. So why does CPDN fail?
I suspect the answer lies in the fact that these programs were written by the met office to run on their supercomputer(s) which were not subject to random shutdowns by either the operating system or users. If they had been written for running on personal computers much would have been done differently.

I am reminded of the story of a traveller in a remote rural location asking fore directions and being told,

"Well, If I really wanted to be going there, I would not be setting off from here." This is why, while I am sure there are still things to be improved in respect to this issue at least, the OIFS tasks are much better behaved.
ID: 68900 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 68901 - Posted: 9 Jun 2023, 10:48:09 UTC - in response to Message 68900.  

Uh.... that's my point? Windows waits. So why does CPDN fail?
I suspect the answer lies in the fact that these programs were written by the met office to run on their supercomputer(s) which were not subject to random shutdowns by either the operating system or users. If they had been written for running on personal computers much would have been done differently.
It's true, running on home PCs is a very different environment. Though, as these models are run on thousands of nodes, it was not uncommon for a blade in a node to fail during a forecast (or part of the interconnect to fail).
ID: 68901 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68902 - Posted: 9 Jun 2023, 21:31:12 UTC

But it's Boinc doing things wrong. It acts as an intermediary between Windows and the CPDN app. When Windows says "I'm shutting down", Boinc should tell CPDN to shut down, and then wait until CPDN says it's finished, and only then tell Windows Boinc is ready for shutdown. This should all be part of the Boinc wrapper yours and every project uses.
ID: 68902 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68903 - Posted: 9 Jun 2023, 22:14:07 UTC - in response to Message 68902.  

But it's Boinc doing things wrong. It acts as an intermediary between Windows and the CPDN app. When Windows says "I'm shutting down", Boinc should tell CPDN to shut down, and then wait until CPDN says it's finished, and only then tell Windows Boinc is ready for shutdown. This should all be part of the Boinc wrapper yours and every project uses.


I may misunderstand something. I do not know about Windows, but in Linux, only CPDN tasks have a wrapper. The wrapper is started by the Boinc Client and does not do much. But it forks another process that does almost all the work.

So on shutdown, one would expect the system to tell the boinc client to shutdown, then the client would tell the wrapper, unique to CPDN, to shutdown, and perhaps it does. It is then the responsibility of the wrapper to shutdown the working process since the client does not know about it. But on shutdown, it may be that the Linux kernel shuts down the working process first, and then the wrapper, if still running, does not know what to do: it is too late. I do not understand the inner workings of the boinc client, the wrapper and the working task all that well. But none of the other six projects I run work like this.
ID: 68903 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68904 - Posted: 9 Jun 2023, 22:40:26 UTC

I was mistaken, I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one. The CPDN app runs natively in whatever OS you have.

But wrapper or not, surely Boinc is already set up with signals to give to the app and get back from the app to tell it to shutdown and the app to tell Boinc it's ready to shutdown.
ID: 68904 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 68905 - Posted: 11 Jun 2023, 11:15:03 UTC - in response to Message 68904.  

I was mistaken, I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one. The CPDN app runs natively in whatever OS you have.

But wrapper or not, surely Boinc is already set up with signals to give to the app and get back from the app to tell it to shutdown and the app to tell Boinc it's ready to shutdown.
Hi Peter, it depends on what you mean by a 'wrapper'. To CPDN/boinc a wrapper is a separately running process that controls the models. CPDN *do* use wrappers but maybe not the same that you're referring to.

The MetO and OpenIFS models both use wrappers though they function a little differently. CPDN are moving to the OpenIFS way of doing it so I'll focus on that. The 'wrapper' for OpenIFS is a separate process (i.e. program) that talks directly to boinc *and* the model. The model has no knowledge it's running under boinc. The job of the wrapper is to start the model, monitor it, send output back to cpdn and kill it if the boinc client asks.

We could put all this inside a virtual machine (like LHC), but this wouldn't be a 'wrapper' in software engineering terms.
---
CPDN Visiting Scientist
ID: 68905 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68906 - Posted: 11 Jun 2023, 11:26:50 UTC - in response to Message 68905.  
Last modified: 11 Jun 2023, 11:29:25 UTC

Put things in a virtual machine as a last resort. Virtualbox causes endless problems and slows the OS interface down especially if multiple ones are running.

LHC also hates shutdowns. Not sure if that's virtualbox or LHC's fault. But randomly they get computation errors on restart. It appears it's shutting down nicely, when I shutdown windows, it says "virtualbox is saving state" for each running task. Those quickly go away, then I'm left with "virtualbox has active connections", and nobody knows why. Someone told me it has to do with an internet connection left open to LHC, but I have no idea what I'm supposed to do about it, so I ignore it.
ID: 68906 · Report as offensive
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 68907 - Posted: 15 Jun 2023, 17:59:06 UTC - in response to Message 68853.  

The researcher for the EAS tasks has discussed the results with her professor and there were some concerns about the spin up results but they have decided everything is within range and the mainsite tasks will be released, "very soon."


Do we have an ETA yet?
ID: 68907 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 68908 - Posted: 15 Jun 2023, 20:17:19 UTC - in response to Message 68907.  

Do we have an ETA yet?
No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra.
ID: 68908 · Report as offensive
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 68909 - Posted: 16 Jun 2023, 3:06:04 UTC - in response to Message 68908.  

Do we have an ETA yet?
No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra.

Got it. Thanks for the update, even if the update is "no new news".
ID: 68909 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,476,460
RAC: 15,681
Message 68910 - Posted: 16 Jun 2023, 13:17:12 UTC - in response to Message 68909.  

Do we have an ETA yet?
No news since that message. I have seen, "soon" mean the following day but also mean another month. Any update I give at this point would not be speaking ex-cathedra.
Got it. Thanks for the update, even if the update is "no new news".
Some of the CPDN folk are on their holidays now, so I don't expect things to more very quickly.
---
CPDN Visiting Scientist
ID: 68910 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68911 - Posted: 16 Jun 2023, 14:37:27 UTC - in response to Message 68910.  

Some of the CPDN folk are on their holidays now, so I don't expect things to more very quickly.
Enjoying the global warming heatwave!
ID: 68911 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 68912 - Posted: 17 Jun 2023, 17:35:51 UTC - in response to Message 68904.  

I thought all projects used wrappers. But now I think it's just "virtualbox" ones, like LHC, which would mean CPDN doesn't use one.


Well, it is pretty obvious that CPDN does use wrappers, if we are talking about the same thing. The wrapper is started by the Boinc-Client. The wrapper spins off the actual program that does most of the work. As far as I can tell, the wrapper compresses trickles and results and sends them back to the server. My impression may not be correct in the details, but that is the general idea. The Boinc-Client does not know anything about the spun-offed process (the wrapped process) at all.

And all this applies both to the Oifs work and the traditional work too.

The only other case of wrappers that I am aware of are the SCC1 tasks on WCG, but they are very different from any others. They do not appear on the "top" list, but you can find them in the "pstree" list. That is in Linux.
ID: 68912 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68913 - Posted: 17 Jun 2023, 19:54:22 UTC - in response to Message 68912.  

Well, it is pretty obvious that CPDN does use wrappers, if we are talking about the same thing. The wrapper is started by the Boinc-Client. The wrapper spins off the actual program that does most of the work. As far as I can tell, the wrapper compresses trickles and results and sends them back to the server. My impression may not be correct in the details, but that is the general idea. The Boinc-Client does not know anything about the spun-offed process (the wrapped process) at all.

And all this applies both to the Oifs work and the traditional work too.

The only other case of wrappers that I am aware of are the SCC1 tasks on WCG, but they are very different from any others. They do not appear on the "top" list, but you can find them in the "pstree" list. That is in Linux.
Ok maybe it's an option, I read about them somewhere on the Boinc website and thought you had to use one. Must be just one method of interfacing your program to Boinc, and most projects program them to talk directly.
ID: 68913 · Report as offensive
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 68914 - Posted: 18 Jun 2023, 10:13:29 UTC

Although not related to new work but following on from the last couple of posts,
CMDock uses a wrapper and it shows under Linux,
I believe that YAFU also uses a wrapper and possibly YOYO, SRBase, TNGrid? and a few others. In some cases it is needed due to the type of programme being used or the code it has been written in.

A few other projects also use a "Trickle up" method to keep the Server updated with progress (Primegrid is one) and some of these projects need a wrapper for this purpose.

Conan
ID: 68914 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68934 - Posted: 23 Jun 2023, 20:14:14 UTC
Last modified: 23 Jun 2023, 20:14:50 UTC

I've received some WAH Windows tasks. 7 of them are running ok, but about 20 got a computation error within minutes. Let me know if you want details, or I assume the programmers can look at the errors upon return (I've sent them back). Not sure what the cause is. My best machine is running all of it's 5 correctly. The two running 1 correctly caused several failures. They're slower and older, but since they're managing one, I can't see it's their fault. Strange my best machine got none wrong.
ID: 68934 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 68939 - Posted: 23 Jun 2023, 23:27:09 UTC
Last modified: 23 Jun 2023, 23:29:04 UTC

Looks like the new WAH batch only works on better machines. To establish what's wrong, here's my lot and Mikey's lot. We have both had crashes on the older machines, but the newer ones are crunching through them ok so far (mine have been running for several hours, the crashes occur in a few minutes).

Mine: https://www.cpdn.org/hosts_user.php?userid=2002390
Mikey's: https://www.cpdn.org/hosts_user.php?userid=1976984

Here's an error output from one of mine which crashes on a Xeon X5650 (old 12 core CPU):
https://www.cpdn.org/result.php?resultid=22325797
ID: 68939 · Report as offensive
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 32
Credit: 226,546
RAC: 4,080
Message 68940 - Posted: 23 Jun 2023, 23:47:16 UTC - in response to Message 68939.  

Looks like intel(r) xeon(r) cpu x5650 doesn't have AVX support.
ID: 68940 · Report as offensive
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 32
Credit: 226,546
RAC: 4,080
Message 68941 - Posted: 23 Jun 2023, 23:53:54 UTC - in response to Message 68891.  

Task should retain previous checkpoint and if on restart it detects that checkpoint is corrupted it should use previous checkpoint.
ID: 68941 · Report as offensive
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org