climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 42 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66546 - Posted: 23 Nov 2022, 9:18:28 UTC

But not automatically within Boinc yet presumably.
Correct.
In projects such as asteroids, there are volunteers helping code OpenCL for GPUs for example. Have you obtained the services of us users for here too? (Not myself, but I'm sure there are many).
For getting Open Box to run automagically in a shorter time frame, if volunteers were to come forward with the requisite skills, I suspect they would be welcomed. Volunteers don't get around the license conditions of the code used, again something that has been said many times.
ID: 66546 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66547 - Posted: 23 Nov 2022, 11:12:07 UTC - in response to Message 66546.  

For getting Open Box to run automagically in a shorter time frame, if volunteers were to come forward with the requisite skills, I suspect they would be welcomed. Volunteers don't get around the license conditions of the code used, again something that has been said many times.
Not sure what you mean by that, your first sentence sounds like yes, your second sounds like they wouldn't be allowed.
ID: 66547 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66548 - Posted: 23 Nov 2022, 11:32:52 UTC - in response to Message 66546.  
Last modified: 23 Nov 2022, 11:55:45 UTC

Volunteers don't get around the license conditions of the code used, again something that has been said many times.
There would have to be a formal association with U.Oxford in some form, then volunteers would be covered by the university licensing (as I am), at least for the model software. The boinc control code for the models is all open source.

But what we're talking about here is implementation of binary executables in a new container environment, that doesn't need a source code license. The binary that CPDN volunteers run is covered by a 'binary only' license (if you ever look in the slot directory, you will see the PDF).
ID: 66548 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66550 - Posted: 23 Nov 2022, 11:38:55 UTC

Final dev tests are running now for the Perturbed Surface (ps) variant of linux OpenIFS (Dave has 3 & I've got 1). All looks ok. Currently preparing the files for the big release on the production site.

Plan is to release batches of ~1000 workunits each starting next week. There will be a total of 42 of these batches to go out, with the possibility of another 40 odd if the scientist wants.

The Baroclinic Lifecycle (ls) variant of the model, under a separate project is also nearing final testing. That will not be releasing as many tasks as the ps one.

I've created a new thread 'OpenIFS models discussion' under Number Crunching to spare this thread from discussion about the models now they will be appearing in large numbers. There is also the OpenIFS FAQ which I will keep updated periodically and I hope moderators can direct questions to.
ID: 66550 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66551 - Posted: 23 Nov 2022, 12:04:05 UTC - in response to Message 66550.  
Last modified: 23 Nov 2022, 12:07:54 UTC

In fact, I seem to have four running. Total memory use on my system seems to peak just below 21GB out of 32 with four so that is fine. Running 8 at once should probably wait till I have my upgrade to 64GB though and even then, restricting my box to 7 out of 8 real cores may well prove to give a higher throughput.

Edit. 10% complete in one hour ten minutes. but BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out.
ID: 66551 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66552 - Posted: 23 Nov 2022, 12:16:22 UTC - in response to Message 66551.  

Edit. 10% complete in one hour ten minutes. but BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out.
Boinc never predicts time correctly. My primegrid tasks say 100 days when I know they'll take 6. Despite having done hundreds of them, Boinc never learns. Surely credit is based on flops, not Boinc estimates?
ID: 66552 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66553 - Posted: 23 Nov 2022, 13:12:06 UTC - in response to Message 66552.  

Surely credit is based on flops, not Boinc estimates?
It is for well behaved projects but projects are free do do their own thing and some do. However, I am guessing that the data BOINC uses to work out an estimate, is somehow related to the amount of computation i.e. flops.

Depending on the project sometimes different tasks even from the same batch might generate wildly differing amounts of computation, especially ones where the result of one computation dictates the data that goes into the next but my experience is that there is very little variation in computation time between tasks from a particular batch for CPDN.
ID: 66553 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,698,338
RAC: 10,100
Message 66554 - Posted: 23 Nov 2022, 14:04:29 UTC - in response to Message 66553.  

BOINC (generically) has used varying credit schemes over the years in the central code made available to projects - we're now on to the third, "CreditNew" from 2010. The generic code is indeed based on the size of a task (fpops) and the speed of processing (flops per second).

But the generic code is just a starting point for any particular project to use, or not, as they please. CPDN, in particular, bases its credit awards - both total and RAC - on trickles, and in particular on the timestep reported in the latest trickle reported to the server. As Dave rightly says, the amount of CPU processing needed for each timestep is pretty constant, so the timestep is a reasonable surrogate for fpops. The only variable is the ratio between fpops and trickles, and that varies between the different model types and the setups for each batch.

Many years ago, we hit a problem where the RAC figures for a new model were completely out of step with the total credit awarded - and the staff then in post couldn't work out why. I volunteered to look into the calculations for an explanation. A deep dive into my PM inbox has given me a name (Milo Thurston) and a date (5 Dec 2007), but the rest of the conversation is lost. It may exist in the archives of the moderators' email list.

From memory, the first suspicion fell on the code (SQL queries) which inserted credit and RAC into the database. I was able to reproduce that locally, and confirm that it worked - including the decay in RAC over time. A clean bill of health there, so attention switched to the data. Milo sent me a small table which defined the processing, credit, and RAC multipliers for each model type. The RAC figure for the new model had been copied from another model type, and was inappropriate. Voila.

There was some other internal reason for keeping that figure, so Milo introduced a fiddle factor for the new type into the SQL query - it's probably still there to this day.
ID: 66554 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66555 - Posted: 23 Nov 2022, 22:30:55 UTC - in response to Message 66551.  

BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out.
Ignore the estimated time. That relies on calculations by boinc with hints from the app. My understanding is it doesn't get that right until it's seen the app enough times. I turn that column off.
Credit is a hand wavy number guesstimated by Andy for openifs to give similar numbers to the Hadley models. I know this because I asked him! Nothing more clever then that.
The fraction done reported is now accurate, I fixed that bug so it's easy to work out expected finish yourself.
I was getting 10% done in just under 1 hr with 100% CPU, i7-11700. But then I did compile the model on the same machine :)
Anyway, need to check results but looking good for production release.
ID: 66555 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66556 - Posted: 23 Nov 2022, 22:41:47 UTC - in response to Message 66555.  

Ignore the estimated time. That relies on calculations by boinc with hints from the app. My understanding is it doesn't get that right until it's seen the app enough times. I turn that column off.

....

The fraction done reported is now accurate, I fixed that bug so it's easy to work out expected finish yourself.
I use <fraction_done_exact> in app_config so I get sensible estimates. Not sure why Boinc doesn't just do this all the time. It simply takes the % done and the time taken and multiplies it up to 100%.
ID: 66556 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66557 - Posted: 24 Nov 2022, 6:00:02 UTC

Between 11hrs4 minutes and 11:20 here. The fastest one was running on its own for 3/4 of the time but a larger sample will be needed to know whether this is random or due to running on its own. I will be sticking to running just two of these at a time when they hit the main site as any more than that produces data faster than it can upload.
ID: 66557 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66567 - Posted: 24 Nov 2022, 16:20:26 UTC

Testing of the oifs_43r3_ps app was successful. A small test will go to the production site, to test the boinc server config as this is the first time the new app was gone out.

If that's successful the batches of 1000 workunits will then go out shortly after.

The oifs_43r3_bl app is not far behind with testing.
ID: 66567 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66579 - Posted: 25 Nov 2022, 12:19:43 UTC

Small test of app oifs_43r3_ps now on production server.

Large batches are expected to go out Monday 28th.
ID: 66579 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66583 - Posted: 25 Nov 2022, 14:39:56 UTC - in response to Message 66579.  
Last modified: 25 Nov 2022, 14:51:10 UTC

I see two batches, one of 2 tasks and one of 5 showing on the page moderators can check details of batches on. Nothing showing as ready to send or in progress on server status page at the moment but that is not unusual as it is only updated about every 2 hours or so.

Edit: Unless these have been restricted to go to particular machines, the vagaries of who gets them might mean a long wait till they come back.

Edit 2 Now showing on server status page as ready to go which means they have probably gone.
ID: 66583 · Report as offensive
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,967,615
RAC: 14,422
Message 66585 - Posted: 25 Nov 2022, 17:38:48 UTC - in response to Message 66583.  

Showing as in progress
ID: 66585 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66586 - Posted: 26 Nov 2022, 0:31:43 UTC - in response to Message 66585.  
Last modified: 26 Nov 2022, 0:38:56 UTC

Showing as in progress
That doesn't mean much. It just means it's on a volunteer machine, either 'ready to start', 'running' or 'suspended'. None of the sent jobs have produced any trickles which tells me they are not actually running yet. Probably sitting behind a bunch of HadSM4 jobs. I've asked the workunit deadline to be set to 30 days for these OpenIFS tasks to get around the problem of volunteers who set a large job cache.

If the test jobs are not back Monday, I'll speak with Andy and if necessary we'll send another test batch to targeted machines (i.e. mine) which will return quickly.
ID: 66586 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4536
Credit: 18,993,249
RAC: 21,753
Message 66587 - Posted: 26 Nov 2022, 6:54:44 UTC - in response to Message 66586.  

That doesn't mean much. It just means it's on a volunteer machine, either 'ready to start', 'running' or 'suspended'. None of the sent jobs have produced any trickles which tells me they are not actually running yet. Probably sitting behind a bunch of HadSM4 jobs. I've asked the workunit deadline to be set to 30 days for these OpenIFS tasks to get around the problem of volunteers who set a large job cache.
Two of the batch of five now showing as completed.
ID: 66587 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66588 - Posted: 26 Nov 2022, 11:13:40 UTC - in response to Message 66587.  

Two of the batch of five now showing as completed.
One failed so far. Looks like it failed right at the end so probably something went wrong with the final zip of output files which I've seen happen before. Will look more closely at the returned output Monday, though it's too late to implement any boinc wrapper code changes now.
ID: 66588 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66606 - Posted: 28 Nov 2022, 20:20:06 UTC
Last modified: 28 Nov 2022, 20:23:35 UTC

The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_ps

3000 in this batch. Many more to follow.

Typical runtime is ~7-10hrs depending on your CPU. Memory should be ~6Gb.

P.S. for those who like to see how the model is progressing. Go into the correct slot directory and then: 'tail ifs.stat'. The 4th column is the current model step count, the 5th is the cpu time for that step. Total number of steps is 2952. On no account edit ifs.stat, it will risk breaking the task (or any of the files in that slot directory).
ID: 66606 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66607 - Posted: 28 Nov 2022, 21:01:38 UTC - in response to Message 66606.  

The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_ps

3000 in this batch. Many more to follow.

Typical runtime is ~7-10hrs depending on your CPU. Memory should be ~6Gb.


That is the name. Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files. However, it looks really OK:

[/var/lib/boinc/slots/10]# file master.exe 
master.exe: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=c3f8ea54db10edfe769adb8096efc92f023410a8, for GNU/Linux 3.2.0, stripped


My Linux kernel is 4.18.0-372.26.1.el8_6.x86_64

My two take 2.5 and 3.5 GBytes working set but the amounts jump around a lot.
Predicted 2 days 18 hours to go, having done about 1 hour 18 minutes each.

Deadline Wednesday 28 December 2022 02:29:34

The processor cache is running pretty well, but the programs do not all fit in it, so about half the references have to hit the RAM.
Memory 	62.28 GB
Cache 	16896 KB

# perf stat -aB -e cache-references,cache-misses
 Performance counter stats for 'system wide':

    20,162,191,759      cache-references                                            
     9,499,792,797      cache-misses              #   47.117 % of all cache refs    

      63.876800684 seconds time elapsed


There is no swapping to disk even though there are 10 other Boinc tasks running
ID: 66607 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org