Message boards : Number crunching : New work discussion - 2
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
But not automatically within Boinc yet presumably.Correct. In projects such as asteroids, there are volunteers helping code OpenCL for GPUs for example. Have you obtained the services of us users for here too? (Not myself, but I'm sure there are many).For getting Open Box to run automagically in a shorter time frame, if volunteers were to come forward with the requisite skills, I suspect they would be welcomed. Volunteers don't get around the license conditions of the code used, again something that has been said many times. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
For getting Open Box to run automagically in a shorter time frame, if volunteers were to come forward with the requisite skills, I suspect they would be welcomed. Volunteers don't get around the license conditions of the code used, again something that has been said many times.Not sure what you mean by that, your first sentence sounds like yes, your second sounds like they wouldn't be allowed. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Volunteers don't get around the license conditions of the code used, again something that has been said many times.There would have to be a formal association with U.Oxford in some form, then volunteers would be covered by the university licensing (as I am), at least for the model software. The boinc control code for the models is all open source. But what we're talking about here is implementation of binary executables in a new container environment, that doesn't need a source code license. The binary that CPDN volunteers run is covered by a 'binary only' license (if you ever look in the slot directory, you will see the PDF). |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Final dev tests are running now for the Perturbed Surface (ps) variant of linux OpenIFS (Dave has 3 & I've got 1). All looks ok. Currently preparing the files for the big release on the production site. Plan is to release batches of ~1000 workunits each starting next week. There will be a total of 42 of these batches to go out, with the possibility of another 40 odd if the scientist wants. The Baroclinic Lifecycle (ls) variant of the model, under a separate project is also nearing final testing. That will not be releasing as many tasks as the ps one. I've created a new thread 'OpenIFS models discussion' under Number Crunching to spare this thread from discussion about the models now they will be appearing in large numbers. There is also the OpenIFS FAQ which I will keep updated periodically and I hope moderators can direct questions to. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
In fact, I seem to have four running. Total memory use on my system seems to peak just below 21GB out of 32 with four so that is fine. Running 8 at once should probably wait till I have my upgrade to 64GB though and even then, restricting my box to 7 out of 8 real cores may well prove to give a higher throughput. Edit. 10% complete in one hour ten minutes. but BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Edit. 10% complete in one hour ten minutes. but BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out.Boinc never predicts time correctly. My primegrid tasks say 100 days when I know they'll take 6. Despite having done hundreds of them, Boinc never learns. Surely credit is based on flops, not Boinc estimates? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Surely credit is based on flops, not Boinc estimates?It is for well behaved projects but projects are free do do their own thing and some do. However, I am guessing that the data BOINC uses to work out an estimate, is somehow related to the amount of computation i.e. flops. Depending on the project sometimes different tasks even from the same batch might generate wildly differing amounts of computation, especially ones where the result of one computation dictates the data that goes into the next but my experience is that there is very little variation in computation time between tasks from a particular batch for CPDN. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,714,904 RAC: 8,478 |
BOINC (generically) has used varying credit schemes over the years in the central code made available to projects - we're now on to the third, "CreditNew" from 2010. The generic code is indeed based on the size of a task (fpops) and the speed of processing (flops per second). But the generic code is just a starting point for any particular project to use, or not, as they please. CPDN, in particular, bases its credit awards - both total and RAC - on trickles, and in particular on the timestep reported in the latest trickle reported to the server. As Dave rightly says, the amount of CPU processing needed for each timestep is pretty constant, so the timestep is a reasonable surrogate for fpops. The only variable is the ratio between fpops and trickles, and that varies between the different model types and the setups for each batch. Many years ago, we hit a problem where the RAC figures for a new model were completely out of step with the total credit awarded - and the staff then in post couldn't work out why. I volunteered to look into the calculations for an explanation. A deep dive into my PM inbox has given me a name (Milo Thurston) and a date (5 Dec 2007), but the rest of the conversation is lost. It may exist in the archives of the moderators' email list. From memory, the first suspicion fell on the code (SQL queries) which inserted credit and RAC into the database. I was able to reproduce that locally, and confirm that it worked - including the decay in RAC over time. A clean bill of health there, so attention switched to the data. Milo sent me a small table which defined the processing, credit, and RAC multipliers for each model type. The RAC figure for the new model had been copied from another model type, and was inappropriate. Voila. There was some other internal reason for keeping that figure, so Milo introduced a fiddle factor for the new type into the SQL query - it's probably still there to this day. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
BOINC is estimating over four days to completion. If the credits are based the same data as the estimated time, then that will need to be changed though estimated time on new model types is often wildly out.Ignore the estimated time. That relies on calculations by boinc with hints from the app. My understanding is it doesn't get that right until it's seen the app enough times. I turn that column off. Credit is a hand wavy number guesstimated by Andy for openifs to give similar numbers to the Hadley models. I know this because I asked him! Nothing more clever then that. The fraction done reported is now accurate, I fixed that bug so it's easy to work out expected finish yourself. I was getting 10% done in just under 1 hr with 100% CPU, i7-11700. But then I did compile the model on the same machine :) Anyway, need to check results but looking good for production release. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
Ignore the estimated time. That relies on calculations by boinc with hints from the app. My understanding is it doesn't get that right until it's seen the app enough times. I turn that column off.I use <fraction_done_exact> in app_config so I get sensible estimates. Not sure why Boinc doesn't just do this all the time. It simply takes the % done and the time taken and multiplies it up to 100%. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Between 11hrs4 minutes and 11:20 here. The fastest one was running on its own for 3/4 of the time but a larger sample will be needed to know whether this is random or due to running on its own. I will be sticking to running just two of these at a time when they hit the main site as any more than that produces data faster than it can upload. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Testing of the oifs_43r3_ps app was successful. A small test will go to the production site, to test the boinc server config as this is the first time the new app was gone out. If that's successful the batches of 1000 workunits will then go out shortly after. The oifs_43r3_bl app is not far behind with testing. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Small test of app oifs_43r3_ps now on production server. Large batches are expected to go out Monday 28th. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I see two batches, one of 2 tasks and one of 5 showing on the page moderators can check details of batches on. Nothing showing as ready to send or in progress on server status page at the moment but that is not unusual as it is only updated about every 2 hours or so. Edit: Unless these have been restricted to go to particular machines, the vagaries of who gets them might mean a long wait till they come back. Edit 2 Now showing on server status page as ready to go which means they have probably gone. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,020,649 RAC: 14,464 |
Showing as in progress |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Showing as in progressThat doesn't mean much. It just means it's on a volunteer machine, either 'ready to start', 'running' or 'suspended'. None of the sent jobs have produced any trickles which tells me they are not actually running yet. Probably sitting behind a bunch of HadSM4 jobs. I've asked the workunit deadline to be set to 30 days for these OpenIFS tasks to get around the problem of volunteers who set a large job cache. If the test jobs are not back Monday, I'll speak with Andy and if necessary we'll send another test batch to targeted machines (i.e. mine) which will return quickly. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
That doesn't mean much. It just means it's on a volunteer machine, either 'ready to start', 'running' or 'suspended'. None of the sent jobs have produced any trickles which tells me they are not actually running yet. Probably sitting behind a bunch of HadSM4 jobs. I've asked the workunit deadline to be set to 30 days for these OpenIFS tasks to get around the problem of volunteers who set a large job cache.Two of the batch of five now showing as completed. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
Two of the batch of five now showing as completed.One failed so far. Looks like it failed right at the end so probably something went wrong with the final zip of output files which I've seen happen before. Will look more closely at the returned output Monday, though it's too late to implement any boinc wrapper code changes now. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,475,631 RAC: 16,075 |
The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_ps 3000 in this batch. Many more to follow. Typical runtime is ~7-10hrs depending on your CPU. Memory should be ~6Gb. P.S. for those who like to see how the model is progressing. Go into the correct slot directory and then: 'tail ifs.stat'. The 4th column is the current model step count, the 5th is the cpu time for that step. Total number of steps is 2952. On no account edit ifs.stat, it will risk breaking the task (or any of the files in that slot directory). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_ps That is the name. Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files. However, it looks really OK: [/var/lib/boinc/slots/10]# file master.exe master.exe: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=c3f8ea54db10edfe769adb8096efc92f023410a8, for GNU/Linux 3.2.0, stripped My Linux kernel is 4.18.0-372.26.1.el8_6.x86_64 My two take 2.5 and 3.5 GBytes working set but the amounts jump around a lot. Predicted 2 days 18 hours to go, having done about 1 hour 18 minutes each. Deadline Wednesday 28 December 2022 02:29:34 The processor cache is running pretty well, but the programs do not all fit in it, so about half the references have to hit the RAM. Memory 62.28 GB Cache 16896 KB # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 20,162,191,759 cache-references 9,499,792,797 cache-misses # 47.117 % of all cache refs 63.876800684 seconds time elapsed There is no swapping to disk even though there are 10 other Boinc tasks running |
©2024 cpdn.org