Message boards :
Number crunching :
OpenIFS Discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
I think what's happened is CPDN have not set the memory usage limit high enough and depending on what process does what when, it can hit blow past the limit. It's a working theory I want them to test.I have had one of those failures, 06:37:27 STEP 2509 H=2509:00 +CPU= 16.937 06:37:44 STEP 2510 H=2510:00 +CPU= 16.658 06:38:11 STEP 2511 H=2511:00 +CPU= 24.246 Suspend request received from the BOINC client, suspending the child process double free or corruption (out)So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606 |
So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time.It may be an incorrect assumption, but I am presuming that the client either puts the processes in a 'sandbox' (chroot to a slot & restricts memory), or it's killing the process because it exceeds the memory limit, but then I would expect to see a message in the log that it's done that. Anyway, the limits are wrong so let's try the low-hanging fruit first before we try other things on volunteer machines. I'll be doing more testing on my machine in the meantime. Alot of the failed tasks with double free happened right after the trickle files were zipped so I was beginning to suspect that was a clue, but further checking showed that's not as common as I thought. Unfortunately there is not enough information coming back from the controlling wrapper when something goes wrong - something else I hope they will change. I think I've also convinced Andy that we needed to do a more realistic batch test on the dev site, a much bigger batch with more volunteers, to test it as it would go out on the production site. We could have picked up these problems earlier had that been done. Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449 |
What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here? |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. Would that invite be in our Inbox? Or some other way? I assume those invited would be given instructions on how to participate. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site. Yes, it would come via in-box with instructions on how to join the dev site - It will show up as another project cpdn_boinc once anyone invited has joined. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606 |
What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here?Absolutely. No need to run for the full 3 months. I'm more interested in capturing the way volunteers run the tasks on their machine (stuffed to the limit in some cases from what I've read!). I think that's the problem, we haven't tested at the scale we're running on the production site, so the first batch effectively becomes that test. |
Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,406,815 RAC: 15,606 |
Assume we get instructions...It works much the same as climateprediction.net, you get tasks as usual & credit but they may not always work. |
Send message Joined: 28 Jun 14 Posts: 4 Credit: 8,570,955 RAC: 6 |
I'm getting all sorts of errors here. Been trying to budget 8GB of RAM per OpenIFS workunit. This workunit run to the end and then aborted? Did BOINC crash? I was running two at a time on this system with 16GB of RAM. https://www.cpdn.org/result.php?resultid=22247140 <message> Process still present 5 min after writing finish file; aborting</message> This one failed with an upload error. Running one at a time since it only has 8GB of RAM. https://www.cpdn.org/result.php?resultid=22246386 <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_ps_1304_2021050100_123_946_12164393_0_r264053712_122.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> This one failed with code 9. Same machine as previous, but this one may have been running more than one for a while before I noticed the OpenIFS tasks were being sent out. https://www.cpdn.org/result.php?resultid=22247027 <message> process exited with code 9 (0x9, -247)</message> double free or corruption (out) This one ran for 15 hours and somehow has no output file? https://www.cpdn.org/result.php?resultid=22245680 Same machine as previous, ran to the end and then had an upload failure. https://www.cpdn.org/result.php?resultid=22245367 <message> upload failure: <file_xfer_error> <file_name>oifs_43r3_ps_0334_2021050100_123_945_12163423_0_r1586639697_122.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> Meanwhile, this old computer has been running two at a time with no trouble. https://www.cpdn.org/show_host_detail.php?hostid=1526772 These are all on ethernet, sharing a switch. Don't think I'm running into bandwidth issues. Checked a couple of the machines for disk usage. BOINC has 100GB to play with with ~90GB free. Quick edit: Another one just failed. Received this morning on a machine with 8GB of RAM. Running just one workunit. Ran for about 5 hours before failing. "Trickle up message pending" in BOINC manager. Hasn't been reported to the server yet. No output file in the folder, but there was this progress file, if it helps: https://www.cpdn.org/result.php?resultid=22248845 <?xml version="1.0" encoding="utf-8"?> <running_values> <last_cpu_time>19262.910000</last_cpu_time> <upload_file_number>44</upload_file_number> <last_iter>1059</last_iter> <last_upload>3801600</last_upload> <model_completed>0</model_completed> </running_values> |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
The double free or corruption (out)error is a problem with the model or the wrapper code. The failed uploads are because the model has crashed before producing the final upload(s) so they are missing when BOINC tries to upload them to the server once the task has finished. These errors are happening on machines of known good pedigree. Glen is on the case and we may be doing a larger than normal batch over on the testing site to try and resolve this. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,319,265 RAC: 10,169 |
To get the openIFS tasks to run on this VirtualBox ubuntu host: https://www.cpdn.org/results.php?hostid=1512045, I increased the ubuntu VM disc partition from 40GB to 100GB (gparted). After five early openIFS successes, the subsequent tasks have crashed with one error or another. The event log has reported a lot of 'file absent' records, with no obvious local reason that I can see, This afternoon I've increased the memory allocated to the ubuntu VM from 28GB to 32GB and reduced cpus (tasks running) from six to four. On a positive note, after the reboot all the suspended tasks started up successfully! |
Send message Joined: 9 Oct 04 Posts: 82 Credit: 69,918,562 RAC: 8,825 |
Update.Sure this will help! On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it.Understood! Hope less tickles might help for smoother uploads. Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).I do not have any problems to reduce the number of tasks running on my computers to fit into my ADSL bandwidth. I have to remind myself, I offer the scientist a certain amount of compute power, but they have to accept the offer – there are a lot of other worthy BOINC projects! (Hopefully I will remind myself of it, when I will go out shipping computer parts for climatepretiction.net I do not need for my personal daily computer requirements!) However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science. I will send you a PM with my ADSL speed, so you have a number of WUs, I am likely to contribute each day. It is not much! |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science. I notice, with favor, that these Oifs work units come with about a one-month expiry date instead of a one-year one the traditional work units come with. |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,319,265 RAC: 10,169 |
[ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).The broadband uplink here is 12Mbps and downlink at 40Mbps. It's pretty consistent at that speed. The event log showed that uploads from six concurrent tasks over the past few days are taking 12-15 seconds each, which is not giving a network headache. A single new task download (3 jf_c... files) is less than two minutes. |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
On bored band here with a max upload speed of about 100KB/s it can just about keep up with 2 tasks running at a time. Not a problem for me as if they do build up I can just cut down to 1 task running till it catches up. Lower numbers of tasks for testing runs, i sometimes tether my phone to get four times the througput but with a 15GB/month limit I won't be doing that for main site batches of these! |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn Carver wrote: On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it.If the scientist needs 1.72 GB result data per workunit, then that's what I'll be happily producing. After all, it's the data which the scientist desires, not the CPU cycles which produce them. Going by the task properties of those in the first 3000s batch: Based on the CPUs, RAM and disk space which I have available, I could produce >330 results/day = 570 GB/day. If I switched on some older gear and let the flat become uncomfortably warm, it'd be >460 results/day = 790 GB/day.
I have no trouble partitioning my currently running computers such that I produce ≤48 r/d for CPDN and have the rest of computer capacity busy at other projects. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
I'll say one more thing about vboxwrapper, and then I'll stay away from this subject: If you look at LHC@home, the highest producers there run the native Linux ATLAS application, not any of the virtualized applications. And that's no coincidence. One of the reasons is a lot lower RAM requirement by the native application. (Also check out the "average computing" column at apps.php. Or anybody who ever took part in a contest at LHC@home knows very well that the native application is the way to go if computing throughput is of any concern at all.) |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I've had several OOM, despite Boinc being set to use 90% of system RAM, on dedicated hardware (well, a VM dedicated to BOINC tasks in the winter). https://www.cpdn.org/result.php?resultid=22247094 is one - the rest look identical, just a child task exited. It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Obviously, from the comments in this thread, we have folks here who are bottlenecked by CPU, others by RAM, and others by transfer bandwidth. I think I am bottle-necked by the size of my Processor cache. My CPU is pretty fast, 64 GBytes RAM, and I get 75 Megabits per second on my fiber-optic Internet connection. My other computer is a little one running Windows 10, and it spends most of its life doing Boinc, but notOpenIFS. Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 477.76 GB Measured floating point speed 6.13 billion ops/sec Measured integer speed 26.09 billion ops/sec Average upload rate 4480.76 KB/sec Average download rate 45235.53 KB/sec I think the data rates reported by Boinc-CPDN are really Kilobits per second, not KiloBytes per second.) Right now I am running three Oifs tasks, three Rosetta tasks, three WCG tasks, two Einstein tasks, and one (single-processor) MilkyWay task. This shows my machine's cache-miss ratio, so the hit ratio would be 50.45% , Not too bad, but not wonderful either. Other than the 12 boinc processes, the machine is not doing much else at the moment (following my typing into Firefox that is doing nothing else). # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 20,626,539,435 cache-references 10,220,773,584 cache-misses # 49.552 % of all cache refs 61.867007273 seconds time elapsed |
Send message Joined: 4 Oct 19 Posts: 15 Credit: 9,174,915 RAC: 3,722 |
i will happily run tests on the dev server if invited. so far i have 9 tasks that appear to run well - no credit though |
Send message Joined: 15 May 09 Posts: 4535 Credit: 18,983,309 RAC: 21,818 |
It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily.During early stages of these on the testing site, I was able to run 4 tasks on a box that had only 8GB RAM. That laptop is now dead but it did it albeit at a massive hit on speed because it was swapping to disk every timetwo or more tasks peaked in memory usage at the same time. There wasn't much of a hit when only running 2 at once. But, Sadly the client will not limit how many tasks it will run based on memory. I have the whole of the laptop ssd boot disk that I salvaged as swap on this machine so 128GB but am not trying to run 16 or even 8 tasks at once because connection bandwidth is my bottleneck. |
©2024 cpdn.org