Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,323,407 RAC: 10,248 |
Thank you for the update Glenn. EDIT: PS. I've doubled the ubuntu VM disc to 200GB. That should give the VM enough disc headroom for all zips from the task backlog to be kept locally and upload at some point. Update: |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,696,681 RAC: 10,226 |
I'm not sure I can go along with that statement completely. Prior to this most recent upload outage, I was computing 10 tasks simultaneously on two machines, and uploading the intermediate files over a single internet link. And my uploads were fully up-to-date, with no backlog - I was uploading in real time. I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are: * internet connection speed * CPU power * Memory (RAM) * Disk space If any single item from that list is below the balance point, the system as a whole will be less than ideally efficient. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
I did not make this statement after looking at one or two hosts. I looked at recorded server_status.php history. (Sum of 'tasks ready to send' and 'tasks in progress', plotted over time, oifs_43r3_ps only. grafana.kiska.pw has got the record.) In other words, be "we" I don't refer to myself, but to all combined who are, or have been, computing oifs_43r3_ps. We had three modes of progress in January: – upload11 was down. Progress rate was 0. (We had four periods of this in January so far.) – upload11 was online and ran without notable limit of connections. Progress rate was ~3,300 results during 14 hours, followed by upload11 going down again. (This mode was played out twice in January.) – upload11 was online and ran with throttled connection limit. Progress rate was quite constant ~1,500...~2,000 results/day. (There was a single period in this mode. It lasted 8d3h until the tape storage issue.) The latter constant progress rate cannot be the rate at which we are actually producing. If it was, then there would have been a noticeably steeper progress at the start of that stage when everybody had previously stuck files to upload. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,696,681 RAC: 10,226 |
Interesting figures, but I don't think you're comparing like with like. Looking at my own machines: In 'recovery' mode, with a backlog of files after an outage, I can upload a file on average in about 10 seconds - file after file after file, continuously without a break. In 'production' mode, uploading files as they're produced, I can generate about one file every minute and a half on average. So, in very round figures, my production bandwidth needs are roughly 10% of my recovery bandwidth. I think that the server can cope OK with 'production' levels, but struggles with 'recovery' levels. Provided we can raise the mtbf (mean time between failures) to a more comfortable level, we'll be OK. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are:I'm going to add more flesh to that list. Like most computational fluid dynamics codes, OpenIFS moves alot of data around in memory. So 'memory bandwidth' is really key to throughput, not just 'memory' nor 'L3 cache size'. Unless you are lucky enough to have a octa channel EPYC, dual channel memory MBs saturate pretty quickly running multiple OFS tasks (which is why I keep saying don't put 1 task per thread even if you have the RAM space for it). For 'cpu power', read 'single core speed', not 'core count' (we haven't got multicore apps yet). I overclock to get that extra 5-10% if I have the MB for it. Had the upload server been working fine, internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection. There's one key item missing from that list - human resources. CPDN gets by with a skeleton crew in Oxford. It would not survive without the support of volunteers and all the help from forum folk. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Reminder to reset <ncpus> tag in cc_config.xml if you changed it If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>. There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863. It would save CPDN trawling through their database to find these hosts and contact their owners. Thanks! |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection.For those of us with very slow speeds, it is an issue. Mine can not quite keep up with running two tasks at a time but I get that my situation is far from the norm now at least in UK. I do check on a regular basis when I am due an upgrade but no hints so far. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,696,681 RAC: 10,226 |
For those of us with very slow speeds, it is an issue.Understood. Like you, I'm already at the maximum speed easily and affordably available in my location - luckily, BT reached me before it reached you (though it did take them a long while for them to work out how to cross the canal). Getting anything faster would involve moving house to a new location, and I don't think either of us is likely to consider doing that for BOINC! |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
@Richard Haselgrove, note, CPDN's overall oifs_43r3_ps progress _right now_ is likely not subject to one of the three modes which I described, because a few things were apparently changed after the tape storage disaster. ________ Glenn Carver wrote: If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.This host (it is not one of mine) has got ncpus set to 100 when I looked at this link just now. This *may* have been done due to a desire to download new work while lots of uploads were pending. (Fetching new work in such a situation is risky though, given the history of upload11's operations.) However, there is also another possible explanation why the user did this: Boinc-client and its control interfaces (web control, boincmgr, global_config_override.xml, you name it) offer to control the number of CPUs usable by boinc only as a percentage, not as the absolute number of CPUs. Hence some users apply this simple trick: Set <ncpus> to 100, et voilà, <max_ncpus_pct> suddenly becomes equal to the actual absolute CPU count which boinc shall use. So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.) |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,696,681 RAC: 10,226 |
We need also to consider, and check, how a <max_concurrent> in app_config.xml is reported back to the server - probably not at all, because at work fetch time, the emphasis is on allocation, rather than progress. I did point out to Glenn privately (yesterday) that CPDN has the data from the regular BOINC trickles available on the scheduling server. It would take some effort, but they could yield precise information on the number of tasks actually being processed at a given time on a given host - and that data continues to be transferred even when the upload server is baulked. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Interestingly I had a testing OIFS perturbed surface task that got stuck on 99.990% for over an hour today. I copied the slot directory in case any information might prove useful but after stopping and restarting BOINC the task completed successfully. I have no idea what the issue was. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
About next OpenIFS batches: One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.) |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.)I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case. About next OpenIFS batches:It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,742,023 RAC: 51,698 |
I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case.Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain. I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC Supporting BOINC, a great concept ! |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,989,107 RAC: 21,788 |
Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain. One of the many issues where those who write the code are never going to please everyone. I personally would have gone for a plain number but it isn't a biggie. Currently I only have one machine and unless dementia sets in, working out what percentage I need for a particular number of cores isn't arduous but then doing it the other way around wouldn't be either! |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Forthcoming batches Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. They will go as soon as we complete testing on some code changes to fix issues thrown up by the last batches (so should see less task fails). These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). Further scientific & technical details, as requested by forum folk, are being prepared and will be made available soon. Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn Carver wrote: xii5ku wrote:Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting.About next OpenIFS batches:It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,431,665 RAC: 17,512 |
Glenn Carver wrote:It's not possible to 'filter-out' the triple-errors (if I understand what you mean). We don't know a priori what the model will do with applied perturbations until it's run. Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example.xii5ku wrote:Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting.Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot.Some of the model perturbations lead to the model aborting, negative theta, levels crossing, etc. 3 is usually enough to get past any wobbles, if necessary another small batch to rerun can be sent out. One day, when I get more time, I intend to look at the model perturbations coming from running identical tasks across the range of different hardware connected to CPDN. This was done in the early days of CPDN when they ran the very long, big batch, climate simulations. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example. I have to agree. I have received tasks that had failed on 4 previous users, but completed successfully on my machine. With these Oifs tasks, I have had no trouble at all. In recent memory, I have been #3 of a bunch of tasks and completed successfully. And usually the ones before me died of different problems. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,803,682 RAC: 19,762 |
Forthcoming batches Aren't there still at least 12000 new tasks to be processed from the current run by the end of February? I believe that was the number when sending out of new work was turned off a week or so ago. Any idea as to how close things are for it to be turned back on? |
©2024 cpdn.org